Parsing “ ” Characters as Spaces
I'm working on a project for a client where I need to parse data from their legacy HTML pages for more efficient storage. The data appears in this basic format, with multiple key/value pairs on a single line.
Key1: Value1 Key2: Value2...
I'm able to get 95% of the records using preg_match_all('/\w+:\s+\S+/', $line, $items)
The problem I am having is a minority of the lines contain text like this:
Key1: Value1 Key2: Value2
In this case, my script shows that Value1 = Value1 Key2:.
I've tried replacing the strings using both html_entity_decode($line) and str_replace(' ', ' ', $line). With both, I still have characters in the output, and the string isn't correctly parsed.
The pages I am trying to parse are WordPress pages. Inspecting the wp_post record for the page shows that the strings are stored in the database. I believe the pages were populated via an export from MS_Access. Earlier in my script, I've passed the parent of $line through $strip_tags().
Is there any reliable way to eliminate/filter/replace this string?
I've been beating my head against the wall for days on this one, and finally found the answer. I tested every answer given by others. None work. -1 for everyone!
The is being stored in the database as a Unicode character string. It only shows as when rendered in the browser. This removes it.
$line = str_replace("\xC2\xA0", " ", $line);