Parsing “ ” Characters as Spaces

I'm working on a project for a client where I need to parse data from their legacy HTML pages for more efficient storage. The data appears in this basic format, with multiple key/value pairs on a single line.

 Key1: Value1 Key2: Value2...

I'm able to get 95% of the records using preg_match_all('/\w+:\s+\S+/', $line, $items)

The problem I am having is a minority of the lines contain text like this:

 Key1: Value1 Key2: Value2

In this case, my script shows that Value1 = Value1 Key2:.

I've tried replacing the   strings using both html_entity_decode($line) and str_replace(' ', ' ', $line). With both, I still have   characters in the output, and the string isn't correctly parsed.

The pages I am trying to parse are WordPress pages. Inspecting the wp_post record for the page shows that the   strings are stored in the database. I believe the pages were populated via an export from MS_Access. Earlier in my script, I've passed the parent of $line through $strip_tags().

Is there any reliable way to eliminate/filter/replace this   string?


I've been beating my head against the wall for days on this one, and finally found the answer. I tested every answer given by others. None work. -1 for everyone!

The   is being stored in the database as a Unicode character string. It only shows as   when rendered in the browser. This removes it.

$line = str_replace("\xC2\xA0", " ", $line);

Need Your Help

About UNIX Resources Network

Original, collect and organize Developers related documents, information and materials, contains jQuery, Html, CSS, MySQL, .NET, ASP.NET, SQL, objective-c, iPhone, Ruby on Rails, C, SQL Server, Ruby, Arrays, Regex, ASP.NET MVC, WPF, XML, Ajax, DataBase, and so on.