C# - Best Approach to Parsing Webpage?
I've saved an entire webpage's html to a string, and now I want to grab the "href" values from the links, preferably with the ability to save them to different strings later. What's the best way to do this?
I've tried saving the string as an .xml doc and parsing it using an XPathDocument navigator, but (surprise surprise) it doesn't navigate a not-really-an-xml-document too well.
Are regular expressions the best way to achieve what I'm trying to accomplish?
Regular expressions are one way to do it, but it can be problematic.
Most HTML pages can't be parsed using standard html techniques because, as you've found out, most don't validate.
You could spend the time trying to integrate HTML Tidy or a similar tool, but it would be much faster to just build the regex you need.
At the time of this update I've received 15 up and 9 downvotes. I think that maybe people aren't reading the question nor the comments on this answer. All the OP wanted to do was grab the href values. That's it. From that perspective, a simple regex is just fine. If the author had wanted to parse other items then there is no way I would recommend regex as I stated at the beginning, it's problematic at best.
I can recommend the HTML Agility Pack. I've used it in a few cases where I needed to parse HTML and it works great. Once you load your HTML into it, you can use XPath expressions to query the document and get your anchor tags (as well as just about anything else in there).
HtmlDocument yourDoc = // load your HTML; int someCount = yourDoc.DocumentNode.SelectNodes("your_xpath").Count;