parsing a large html-file (local) - with Perl or PHP

I have a large document - I need to parse it and spit out only this part: schule.php?schulnr=80287&lschb=

how do I parse the stuff!?

    <A HREF="schule.php?schulnr=80287&lschb=" target="_blank">
        <center><img border=0 height=16 width=15 src="sh_info.gif"></center>

Love to hear from you


You could also do it this way (it's not perl but more "visual"):

  • Load the document into your browser, if possible
  • Install Firebug extension/add-on
  • Install FirePath extension
  • Copy + Paste this XPath expression into the text field labeled "XPpath:"

    //a[contains(@href, "schule")]/@href

  • Click "Eval" button.

There are also tools to do this on the command line, e.g. "xmllint" (for unix)

xmllint --html --xpath '//a[contains(@href, "schule")]/@href' myfile.php.or.html

You could do further processing from thereon.

You ought to use a DOM parser like PHP Simple HTML DOM Parser

// Create DOM from URL or file
$html = file_get_html('');

// Find all links 
foreach($html->find('a') as $element) 
       echo $element->href . '<br>';

In Perl, the quickest and best way, I know to scan HTML is HTML::PullParser. This is based on a robust HTML parser, not simple FSA like Perl regex (without recursion).

This is more like a SAX filter, than a DOM.

use 5.010;
use constant NOT_FOUND => -1;
use strict;
use warnings;

use English qw<$OS_ERROR>;
use HTML::PullParser ();

my $pp 
    = HTML::PullParser->new(
      # your file or even a handle
      file        => 'my.html'
      # specifies that you want a tuple of tagname, attribute hash
    , start       => 'tag, attr' 
      # you only want to look at tags with tagname = 'a'
    , report_tags => [ 'a' ],
    or die "$OS_ERROR"

my $anchor_url;
while ( defined( my $t = $pp->get_token )) { 
    next unless ref $t or $t->[0] ne 'a'; # this shouldn't happen, really
    my $href = $t->[1]->{href};
    if ( index( $href, 'schule.php?' ) > NOT_FOUND ) { 
        $anchor_url = $href;

What Rfvgyhn said, but in Perl flavor since that was one of the tags: use HTML::TreeBuilder

Plus, for reasons as to why RegEx is almost never a good idea to parse XML/HTML (sometimes it's Good Enough With Major Caveats), read the obligatory and infamous StackOverflow post:

RegEx match open tags except XHTML self-contained tags

Mind you, if the full extent of your task is literally "parse out HREF links", AND you don't have "<link>" tags AND the links (e.g. HREF="something" substrings) are guaranteed not to be used in any other context (e.g. in comments, or as text, or have "HREF=" be part of the link itself), it just might fall into the "Good Enough" category above for regex usage:

my @lines = <>; # Replace with proper method of reading in your file
my @hrefs = map { $_ =~ /href="([^"]+)"/gi; } @lines;

Need Your Help

Matlab-While loops


Given an array I have to print numbers in the array which must be positive and divisible by 2, if the condition does not hold print "didn't find" only once.

How can I get a list of all windows, currently on the screen, in swift?

objective-c cocoa quartz-graphics swift

How can I get a list of all windows, currently on the screen, in swift? (all examples are preceded by import Cocoa)

About UNIX Resources Network

Original, collect and organize Developers related documents, information and materials, contains jQuery, Html, CSS, MySQL, .NET, ASP.NET, SQL, objective-c, iPhone, Ruby on Rails, C, SQL Server, Ruby, Arrays, Regex, ASP.NET MVC, WPF, XML, Ajax, DataBase, and so on.