How do you filter out strangely formatted data from MSWord for a database?

Our programming team currently uses a database using Win1252 encoding, but the database is also not very good at filtering out bad data natively.

Quite often the end users of our programs simply copy+paste their information from MSWord to insert into our database which leads to all kinds of funky characters appearing in our database that occasionally can't be interpreted.

Are there currently any libraries out there that would parse a string encoded with MSWord's native encoding and translate it to similar ascii, UTF8 or Win1252 format?

By similar, I mean translating strange double quotes that look something like `` into the typical ".

Please inform me if my question is vague at all so I can update as necessary.

Answers


Check out Jeff Atwood's solution located here: http://www.codinghorror.com/blog/2006/01/cleaning-words-nasty-html.html

Uses regular expressions. FWIW, a lot of RTE's out there use similar practices when cleaning copy and pasted content.


Need Your Help

How does Random Access File Work

java randomaccessfile

I 'm starting to learn Java I / O and i started with Random Access File,I cant find any good information about using Random Access File if some one cant give me some good references to how it work...

About UNIX Resources Network

Original, collect and organize Developers related documents, information and materials, contains jQuery, Html, CSS, MySQL, .NET, ASP.NET, SQL, objective-c, iPhone, Ruby on Rails, C, SQL Server, Ruby, Arrays, Regex, ASP.NET MVC, WPF, XML, Ajax, DataBase, and so on.