Determining if a URI is valid using Perl regex

For an application I'm developing I need a Perl script which loops through a massive CSV file and ensures that every single line contains a valid URI. I already asked a question earlier about parsing a CSV file and I have started using Text::CSV to make my life a lot easier. Now I have the issue of ensuring that the URI is valid.

Due to the nature of my application, URIs do not need to take the full form of

protocol://username:password@domain.extension/request?vars=values

Rather I am only interested in the request portion of this. For a general website, that would be anything after the .com, .edu, etc.

I currently have the following Perl script:

if($_ !~ /^(?:[a-z0-9-._~!$&'()*+,;=:/?@]|%[0-9A-F]{2})*$/i){
    print "Invalid URL format";
    exit;
} else {
    /* stuff */
}

The regex should be fairly straight-forward. The request is allowed to contain either one of a small set of symbols ([a-z0-9-._~!$&'()*+,;=:/?@]) or it may contain a percent sign (%) followed by two hexadecimal digits. Either of these patterns may be repeated indefinitely.

When I run this script I get the following error:

Number found where operator expected at ./301rules.pl line 58, near "%[0"
        (Missing operator before 0?)
Bareword found where operator expected at ./301rules.pl line 58, near "9A"
        (Missing operator before A?)
Bareword found where operator expected at ./301rules.pl line 58, near "$/i"
        (Missing operator before i?)
syntax error at ./301rules.pl line 58, near "%[0"

It's fairly obvious that something in my regex needs to be escaped, however I'm unsure of what. I tried escaping every possible symbol to create the following regex:

if($_ !~ /^(?:[a-z0-9\-\.\_\~\!\$\&\'\(\)\*\+\,\;\=\:\/\?\@]|%[0-9A-F]{2})*$/i){

However when I did this it just allowed every string to pass the test, even strings which I knew are invalid such as te%st or é

So does anyone have experience with Perl regex and know what I need to escape and what I should not escape? With 19 different symbols I don't feel like trying all 2^19 = 524288 possibilities.

EDIT - voting to close. I found out that the issue actually existed immediately above this loop, although I don't entirely understand why yet.

I had:

if( $_ == "" ){
    next;
}
/* regex conditional from above */

For whatever reason it kept evaluating to true and going to the next iteration despite there clearly being data stored in $_. I'll figure out why this was, but for now the regex works fine with everything escaped.

Answers


I don't know how you got to your first regex, but I'll try helping you fix that. You only have to escape the characters that have special meaning in regex - from your regex, they are: -,.,$,(,),*,/, so the regex should look like:

if($_ !~ /^(?:[a-z0-9\-\._~!\$&'\(\)\*+,;=:\/?@]|%[0-9A-F]{2})*$/i){

I don't exactly know what ?: is trying to achieve there, but your first character class that is just following it (the expression between the first [] ) is not having any multipliers - maybe it should be followed by a *, a +, or a ?. Also, the | sign I think is meant to do the or between your first character class and the second character class preceded by a % - as it looks right now, it does it beteween the first character class and the % sign only. It probably should be like |(%[0-9A-F]{2}))*$


In the documentation for the URI module I found the following:

PARSING URIs WITH REGEXP

As an alternative to this module, the following (official) regular expression can be used to decode a URI:

    my($scheme, $authority, $path,
    $query, $fragment) =   $uri =~
    m|(?:([^:/?#]+):)?(?://([^/?#]*))?([^?#]*)(?:\?([^#]*))?(?:#(.*))?|;

The URI::Split module provides the function uri_split() as a readable alternative.

But I think Regexp::Common::URI is probably an ideal solution for syntax validation of an HTTP URI.

use Regexp::Common qw /URI/;
while (<>) {
    /$RE{URI}{HTTP}/  and  print "Contains an HTTP URI.\n";
}

Anything written by Damian and maintained by Abigail has got to be either inspired, great, crazy, or all of the above. (And I mean that with the highest possible regard).


You should use rfc regexp for checking EVERY possible character. Look at this


Need Your Help

Use SwingWorker to add rows to jTable and update the GUI

java user-interface swing jtable swingworker

I'm trying to create a jTable that, once a button is clicked, adds rows one at a time with just a number starting at zero and continues adding rows until it gets to row 1000000. I'm using a SwingWo...

Is it possible to set 2 different validations while getting them both from the same $_FILES array?

php

In my form I am asking for someone to upload a PDF file &amp; an image. They are both being sent to the same array $_FILES['rfiles'] array. I want to validate $_FILES['rfiles']['type'][0] to

About UNIX Resources Network

Original, collect and organize Developers related documents, information and materials, contains jQuery, Html, CSS, MySQL, .NET, ASP.NET, SQL, objective-c, iPhone, Ruby on Rails, C, SQL Server, Ruby, Arrays, Regex, ASP.NET MVC, WPF, XML, Ajax, DataBase, and so on.