C# - Best Approach to Parsing Webpage?

I've saved an entire webpage's html to a string, and now I want to grab the "href" values from the links, preferably with the ability to save them to different strings later. What's the best way to do this?

I've tried saving the string as an .xml doc and parsing it using an XPathDocument navigator, but (surprise surprise) it doesn't navigate a not-really-an-xml-document too well.

Are regular expressions the best way to achieve what I'm trying to accomplish?

Answers


Regular expressions are one way to do it, but it can be problematic.

Most HTML pages can't be parsed using standard html techniques because, as you've found out, most don't validate.

You could spend the time trying to integrate HTML Tidy or a similar tool, but it would be much faster to just build the regex you need.

UPDATE

At the time of this update I've received 15 up and 9 downvotes. I think that maybe people aren't reading the question nor the comments on this answer. All the OP wanted to do was grab the href values. That's it. From that perspective, a simple regex is just fine. If the author had wanted to parse other items then there is no way I would recommend regex as I stated at the beginning, it's problematic at best.


I can recommend the HTML Agility Pack. I've used it in a few cases where I needed to parse HTML and it works great. Once you load your HTML into it, you can use XPath expressions to query the document and get your anchor tags (as well as just about anything else in there).

HtmlDocument yourDoc = // load your HTML;
int someCount = yourDoc.DocumentNode.SelectNodes("your_xpath").Count;

Need Your Help

Rewrite is not working well

regex apache .htaccess mod-rewrite url-rewriting

I wanna change this url: http://domain.nl/template/houses

How to get changes realtime with jquery in Django

jquery ajax django

I have made a little app that you can add people to and lists them all on a page. I used REST to produce a JSON api and jquery to get the results. How do I use JQuery to update any changes to the p...

About UNIX Resources Network

Original, collect and organize Developers related documents, information and materials, contains jQuery, Html, CSS, MySQL, .NET, ASP.NET, SQL, objective-c, iPhone, Ruby on Rails, C, SQL Server, Ruby, Arrays, Regex, ASP.NET MVC, WPF, XML, Ajax, DataBase, and so on.