How do you filter out strangely formatted data from MSWord for a database?

Our programming team currently uses a database using Win1252 encoding, but the database is also not very good at filtering out bad data natively.

Quite often the end users of our programs simply copy+paste their information from MSWord to insert into our database which leads to all kinds of funky characters appearing in our database that occasionally can't be interpreted.

Are there currently any libraries out there that would parse a string encoded with MSWord's native encoding and translate it to similar ascii, UTF8 or Win1252 format?

By similar, I mean translating strange double quotes that look something like `` into the typical ".

Please inform me if my question is vague at all so I can update as necessary.

Answers


Check out Jeff Atwood's solution located here: http://www.codinghorror.com/blog/2006/01/cleaning-words-nasty-html.html

Uses regular expressions. FWIW, a lot of RTE's out there use similar practices when cleaning copy and pasted content.


Need Your Help

Automating Bamboo Remote Agent Installation for windows 7 - Cannot stop the running agent with StopBambooAgent-NT.bat

bash automation jvm continuous-integration bamboo

I am currently working on creating bash scripts to automate the process of doing a clean install and clean uninstall of Bamboo remote agents on a windows 7 VM.

Starting point for a SQL Server query to call .NET method

c# .net sql-server tsql

I'm looking for a starting point on techonlogies (what to use, what to search for) to accomplish my needs.