PHP Regular expression finding a string until

I have a url grabber setup and it was working fine. It grabs the url of a doc that is in a response header such as:

<script type='text/javascript' language='JavaScript'>
document.location.href = 'http\x3a\x2f\x2fcms.example.com\x2fd\x2fd\x2fworkspace\x2fSpacesStore\x2f61d96949-b8fb-43f1-adaf-0233368984e0\x2fFinancial\x2520Agility\x2520Report.pdf\x3fguest\x3dtrue'
</script>   

Here is my grabber script.

<?php

set_time_limit(0);
$target_url = $_POST['to'];
$html =file_get_contents($target_url);

$pattern = "/document.location.href = '([^']*)'/";
preg_match($pattern, $html, $matches, PREG_OFFSET_CAPTURE, 3);

$raw_url = $matches[1][0];
$eval_url = '$url = "'.$raw_url.'";';

eval($eval_url);
echo $url;

We had to add a variable to our doc management system so each doc url needed ?guest=true on the end of the url. When we did this my grabber returned the full url and appends that to the filename. So I tried to have it grab just the url until it hit /guest=true. With this code:

<?php

set_time_limit(0);

$target_url = $_POST['to'];
$html =file_get_contents($target_url);

$pattern = "/document.location.href = '([^']*)\x3fguest\x3dtrue'/";

preg_match($pattern, $html, $matches, PREG_OFFSET_CAPTURE, 3);

$raw_url = $matches[1][0];
$eval_url = '$url = "'.$raw_url.'";';

eval($eval_url);
echo $url;

Why isn't it returning the url up until the ?guest=true part? aka why doesn't this work? and what's the fix?

Answers


This is the solution. You'll get the match directly, not in group.

set_time_limit(0);

$target_url = $_POST['to'];
$html = file_get_contents($target_url);

$pattern = '/(?<=document\.location\.href = \').*?(?=\\\\x3fguest\\\\x3dtrue)/';

preg_match($pattern, $html, $matches))

$raw_url = $matches[0];
$eval_url = '$url = "'.$raw_url.'";';

eval($eval_url);
echo $url;

You can check out the results here.

The problem with your regex was in the fact that you did not escape certain characters in the string (. and \) that you wanted to catch literary. Furthermore, you do not need to use PREG_OFFSET_CAPTURE and offset of 3. I guess you copied these values from the example on this page.

Here's an explanation of the regex pattern:

# (?<=document\.location\.href = ').*?(?=\\x3fguest\\x3dtrue)
# 
# Assert that the regex below can be matched, with the match ending at this position (positive lookbehind) «(?<=document\.location\.href = ')»
#    Match the characters “document” literally «document»
#    Match the character “.” literally «\.»
#    Match the characters “location” literally «location»
#    Match the character “.” literally «\.»
#    Match the characters “href = '” literally «href = '»
# Match any single character that is not a line break character «.*?»
#    Between zero and unlimited times, as few times as possible, expanding as needed (lazy) «*?»
# Assert that the regex below can be matched, starting at this position (positive lookahead) «(?=\\x3fguest\\x3dtrue')»
#    Match the character “\” literally «\\»
#    Match the characters “x3fguest” literally «x3fguest»
#    Match the character “\” literally «\\»
#    Match the characters “x3dtrue” literally «x3dtrue»

This answer has been edited to reflect updates to the question.


Need Your Help

best practice for avoid connection timeout when using LINQ to SQL

c# .net linq-to-sql

i need to know best practice for avoid connection timeout when using LINQ to SQL in .net applications specially when returning IQueryable&lt;T&gt;from data access tiers or layers.

Passing a variable in redirect in Django

django django-views

I'm trying to pass a variable using the redirect function, but it is returning none.

About UNIX Resources Network

Original, collect and organize Developers related documents, information and materials, contains jQuery, Html, CSS, MySQL, .NET, ASP.NET, SQL, objective-c, iPhone, Ruby on Rails, C, SQL Server, Ruby, Arrays, Regex, ASP.NET MVC, WPF, XML, Ajax, DataBase, and so on.