Parsing the information of a URL out of a HTML <a></a> tags in C

My application gets as part of its data a large html formatted file that contains large amounts of links. Something like what you would get if you search anything on Google or Yahoo or other search engines: a list of URLs and the description or other text.

I've been trying to come out with a function that can parse the URL and the description and save them into a text file but it's proven hard, at least to me. So, if I have:

<a href="http://www.w3schools.com">Visit W3Schools</a>

I would parse http://www.w3schools.com and Visit W3Schools and save them in a file.

Any way to achieve this? in plain C? Any help is appreciated.

Answers


You really need a proper html parser, but for something quick and dirty, try:

bool get_url(char **data, char **url, char **desc)
{
  bool result = false;
  char *ptr = strstr(*data, "<a");

  if(NULL != ptr)
  {
    *data = ptr + 2;

    ptr = strstr(*data, "href=\"");
    if(NULL != ptr)
    {
      *data = ptr + 6;
      *url = *data;

      ptr = strchr(*data, '"');
      if(NULL != ptr)
      {
        *ptr = '\0';
        *data = ptr + 1;

        ptr = strchr(*data, '>');
        if(NULL != ptr)
        {
          *data = ptr + 1;
          *desc = *data;

          ptr = strstr(*data, "</a>");
          if(NULL != ptr)
          {
            *ptr = '\0';
            *data = ptr + 4;
            result = true;
          }
        }
      }
    }
  }

  return result;
}

Not that data gets updated to be beyond the data parsed (it's an in-out parameter) and that the string passed in gets modified. I'm feeling lazy/too busy to do full solutions with memory allocated return strings.

Also you probably ought to return errors on the cascade of close scope braces (except the first one) which is partly why I stacked them up like that. There are other neater solutions that can be adapted to be more generic.

So basically you then call the function repeatedly until it returns false.


Need Your Help

Executing grep with execvp, to read from pipe

c grep exec pipe execvp

I'm trying to imitate the function of the shell command line:

Specify default value for ASP MVC 3 DropDownListFor

asp.net-mvc asp.net-mvc-3 html.dropdownlistfor dropdownlistfor

I've done this a hundred times but not sure what is going on here. I have a DropDownListFor that I populate in the controller like so