How would one use PHP preg_match_all to differentiate anchor elements identified by attribute of inner HTML element?

I have sets of HTML anchor elements enclosing image elements. For each set, using PHP-CLI, I want to pull the URLs and classify them according to their types. The type of anchor can only be determined by an attribute of its child image element. It would be easy if there was only one of each type per set. My problem is when two anchor elements of one type are separated by one or more of the other types. My non-greedy parenthesized sub-pattern seems to become greedy and expands to find the second relevant child attribute. In my test script I'm trying to pull the 'Userlink' URLs from amongst the other types. Using a simple pattern like:

#<a href="(.*?)" custattr="value1"><img alt="Userlink"#

On a set like:

<li><a href="http://www.userlink1.com/my/page.html" custattr="value1"><img alt="Userlink" class="common_link_class" height="123" src="pic0.png" width="123" style="width: 123px;"></a></li><li><a href="http://www.socnet1.com/username1" custattr="value1"><img alt="Socnet1" class="common_link_class" height="123" src="pic1.png" width="123" style="width: 123px;"></a></li><li><a href="http://www.socnet2.com/username1" custattr="value1"><img alt="Socnet2" class="common_link_class" height="123" src="pic2.png" width="123" style="width: 123px;"></a></li><li><a href="mailto:useralias1@unlikely.zyx321.usermail.net" custattr="value1"><img alt="Usermail" class="common_link_class" height="123" src="pic3.png" width="123" style="width: 123px;"></a></li><li><a href="http://www.userlink2.com/my/page.html" custattr="value1"><img alt="Userlink" class="common_link_class" height="123" src="pic4.png" width="123" style="width: 123px;"></a></li>

(sorry, but the actual html is on one line like that)

My sub-pattern captures from the beginning of the first "Userlink" URL to the end of the last one.

I've tried many variations of look-aheads, not sure I should list them all here. So far they've either returned no match at all or the same as described above.

Here's my test script (running in a Bash shell):

#!/usr/bin/php
<?
    $lines = 0;
    $input = "";
    $matches = array();

    while ($line = fgets(STDIN)){
        $input .= $line;
        $lines++;
    }
    fwrite(STDERR, "Processing $lines\n");

    $pcre = '#<a href="(.*?)" custattr="value1"><img alt="Userlink"#';

    if (preg_match_all($pcre,$input,$matches)){
        fwrite(STDERR, "\$matches has " . count($matches) . " elements\n");
        foreach ($matches[1] as $match){
            fwrite(STDOUT, $match . "\n");
        }
    }
?>

What PCRE pattern for PHP's preg_match_all() would return the two "Userlink" URLs in the above example?

Answers


I have taken the liberty of changing your variable names:

$pattern = '~<a href="([^"]++)" custattr="value1"><img alt="Userlink"~';

if ($nb = preg_match_all($pattern, $input, $matches)) {
    fwrite(STDERR, "\$matches has " . $nb . " elements\n");
    fwrite(STDOUT, implode("\n", $match) . "\n");
}

Note that the preg_match_all function returns the number of matches.


This regex should work -

<a href="([^"]*?)"[^>]*\><img alt="Userlink"

You can see how it work here.

Testing it -

$pcre = '/<a href="([^"]*?)"[^>]*\><img alt="Userlink"/';
if (preg_match_all($pcre,$input,$matches)){
    var_dump($matches);
    //$matches[1] will be the array containing the urls.
}
/*
    OUTPUT- 
    array
      0 => 
        array
          0 => string '<a href="http://www.userlink1.com/my/page.html" custattr="value1"><img alt="Userlink"' (length=85)
          1 => string '<a href="http://www.userlink2.com/my/page.html" custattr="value1"><img alt="Userlink"' (length=85)
      1 => 
        array
          0 => string 'http://www.userlink1.com/my/page.html' (length=37)
          1 => string 'http://www.userlink2.com/my/page.html' (length=37)
*/

Need Your Help

.htaccess changing calls to localhost/<file/path> to localhost/something/<file/path>

php .htaccess

I have looked through the q&amp;a's and couldn't work out how to do it based from the answers

python querying all rows of azure table

python azure azure-table-storage

I have around 20000 rows in my azure table . I wanted to query all the rows in the azure table . But due to certain azure limitation i am getting only 1000 rows.

About UNIX Resources Network

Original, collect and organize Developers related documents, information and materials, contains jQuery, Html, CSS, MySQL, .NET, ASP.NET, SQL, objective-c, iPhone, Ruby on Rails, C, SQL Server, Ruby, Arrays, Regex, ASP.NET MVC, WPF, XML, Ajax, DataBase, and so on.