Parsing an HREF from an HTML string using a regular expression

I need to parse a link to a zip file out of html. The name of this zipfile changes every month. Here is a snippet of the HTML I need to parse:

<a href="http://nppes.viva-it.com/NPPES_Data_Dissemination_Mar_2012.zip">

The string I need to get is "http://nppes.viva-it.com/NPPES_Data_Dissemination_Mar_2012.zip" so I can download the file using WebClient. The only portion of that zip file URL that remains constant from month to month is "http://nppes.viva-it.com/". Is there a way using a regular expression to parse the full URL, "http://nppes.viva-it.com/NPPES_Data_Dissemination_Mar_2012.zip", out of the HTML?

Answers


If there will only ever be one ZIP linked to on the page, no problem:

Regex re = new Regex(@"http://nppes\.viva-it\.com/.+\.zip");

re.Match(html).Value // To get the matched URL

Here's a demo.


By using HtmlAgilityPack:

var html = "<a href=\"http://nppes.viva-it.com/NPPES_Data_Dissemination_Mar_2012.zip\">";
var doc = new HtmlDocument();
doc.LoadHtml(html);
var anchor = doc.DocumentNode.SelectSingleNode("//a");
var href = anchor.GetAttributeValue("href", null);

now href variable holds "http://nppes.viva-it.com/NPPES_Data_Dissemination_Mar_2012.zip" value.

Isn't it simplier than regex?


Here is a raw regex - uses branch reset. The answer is in capture buffer 2.

<a 
  (?=\s) 
  (?= (?:[^>"']|"[^"]*"|'[^']*')*? (?<=\s)
    href \s*=
    (?|
        (?> \s* (['"]) \s* (http://nppes\.viva-it\.com/ (?:(?!\g{-2}) .)+ \.zip ) \s*     \g{-2} )
      | (?> (?!\s*['"]) \s* () (http://nppes\.viva-it\.com/ [^\s>]* \.zip ) (?=\s|>) )
    )
  )
  \s+ (?:".*?"|'.*?'|[^>]*?)+ 
>

Not sure if C# can do branch reset. If it can't, this variation works. The answer is always the result of capture buffer 2 catted with capture buffer 3.

<a 
  (?=\s) 
  (?= (?:[^>"']|"[^"]*"|'[^']*')*? (?<=\s)
    href \s*=
    (?:
        (?> \s* (['"]) \s* (http://nppes\.viva-it\.com/ (?:(?!\g{-2}) .)+ \.zip ) \s* \g{-2} )
      | (?> (?!\s*['"]) \s* (http://nppes\.viva-it\.com/ [^\s>]* \.zip ) (?=\s|>) )
    )
  )
  \s+ (?:".*?"|'.*?'|[^>]*?)+ 
>

Need Your Help

Zurb Foundation - What version?

zurb-foundation version

I've been using a free Zurb Foundation template but struggle to add any new features.

Setting up ListFragment Properly

android list fragment

This is my first time using fragments in general so bear with me. I dont really understand what I need to do to get the list to show correctly because as of right now nothing shows when I select a ...

About UNIX Resources Network

Original, collect and organize Developers related documents, information and materials, contains jQuery, Html, CSS, MySQL, .NET, ASP.NET, SQL, objective-c, iPhone, Ruby on Rails, C, SQL Server, Ruby, Arrays, Regex, ASP.NET MVC, WPF, XML, Ajax, DataBase, and so on.