How do you filter out strangely formatted data from MSWord for a database?
Our programming team currently uses a database using Win1252 encoding, but the database is also not very good at filtering out bad data natively.
Quite often the end users of our programs simply copy+paste their information from MSWord to insert into our database which leads to all kinds of funky characters appearing in our database that occasionally can't be interpreted.
Are there currently any libraries out there that would parse a string encoded with MSWord's native encoding and translate it to similar ascii, UTF8 or Win1252 format?
By similar, I mean translating strange double quotes that look something like `` into the typical ".
Please inform me if my question is vague at all so I can update as necessary.
Check out Jeff Atwood's solution located here: http://www.codinghorror.com/blog/2006/01/cleaning-words-nasty-html.html
Uses regular expressions. FWIW, a lot of RTE's out there use similar practices when cleaning copy and pasted content.