Issues decoding strings from Xml

I have been given a large quantity of Xml's where I need to pull out parts of the text elements and reuse it for other purposes. (I am using XDocument to pull Xml data).

But, how do I decode the text contained in the elements? What is even the formatting used here? A few examples:

"What is the meaning of this® asks Sonny."
"The big centre cost 1¾ million pounds"
"... lost it. ® The next ..."

I have tried HttpUtility.HtmlDecode but that did not do the trick. If I decode twice the "®" turns into a ® which is obviously not right.

Looks like ® are line breaks. The ® are probably question marks. The 190 one, I don't even know. Perhaps a dot or comma?

Any ideas would be welcome.

Answers


It does appear that the strings you show have been HTML encoded, and then XML encoded (or HTML again).

It is correct that ® -> ® -> ® (the registered trademark symbol) per the ISO Latin-1 entities - ® should behave the same way

Similarly &amp#190; would turn into a fraction representing three quarters.


Need Your Help

Analyzing web application usage and user patterns using DB/application server logs?

logging analytics

I assume that most of the analyzing and tracking is done based on the data gathered from browser actions like page requests. Tools like AWStats, Google Analytics and Omniture take place in this.

ERB Template removing the trailing line

ruby-on-rails erb

I have an ERB template for sending an email.