Improve regex to match javascript comments

I used the regex given in perlfaq6 to match and remove javascript comments, but it results in segmentation fault when the string is too long. The regex is -

s#/\*[^*]*\*+([^/*][^*]*\*+)*/|//([^\\]|[^\n][\n]?)*?\n|("(\\.|[^"\\])*"|'(\\.|[^'\\])*'|.[^/"'\\]*)#defined $3 ? $3 : ""#gse;

Can it be improved to avoid segmentation fault ?

[EDIT]

Long input:

<ent r=\"6\" t=\"259\" w=\"252\" /><ent r=\"6\" t=\"257\" w=\"219\" />

repeated about a 1000 times.

Answers


I suspect the trouble is partly that your 'C code' isn't very much like C code. In C, you can't have the sequence \" outside a pair of quotes, single or double, for example.

I adapted the regex to make it readable and wrapped into a trivial script that slurps its input and applies the regex to it:

#!/usr/bin/env perl

### Original regex from PerlFAQ6.
### s#/\*[^*]*\*+([^/*][^*]*\*+)*/|//([^\\]|[^\n][\n]?)*?\n|("(\\.|[^"\\])*"|'(\\.|[^'\\])*'|.[^/"'\\]*)#defined $3 ? $3 : ""#gse;

undef $/;  # Slurp input

while (<>)
{
    print "raw: $_";

    s%
        /\*[^*]*\*+([^/*][^*]*\*+)*/    # Simple C comments
     |  //([^\\]|[^\n][\n]?)*?\n        # C++ comments, allowing for backslash-newline continuation
     |  (
            "(\\.|[^"\\])*"             # Double-quoted strings
        |   '(\\.|[^'\\])*'             # Single-quoted characters
        |   .[^/"'\\]*                  # Anything else
        )
     %    defined $3 ? $3 : ""
     %egsx;

    print "out: $_";
}

I took your line of non-C code, and created files data.1, data.2, data.4, data.8, ..., data.1024 with the appropriate number of lines in each. I then ran a timing loop.

$ for x in 1 2 4 8 16 32 64 128 256 512 1024
> do
>     echo
>     echo $x
>     time perl xx.pl data.$x > /dev/null
> done
$

I've munged the output to give just the real time for the different file sizes:

   1    0m0.022s
   2    0m0.005s
   4    0m0.007s
   8    0m0.013s
  16    0m0.035s
  32    0m0.130s
  64    0m0.523s
 128    0m2.035s
 256    0m6.756s
 512    0m28.062s
1024    1m36.134s

I did not get a core dump (Perl 5.16.0 on Mac OS X 10.7.4; 8 GiB main memory). It does begin to take a significant amount of time. While it was running, it was not growing; during the 1024-line run, it was using about 13 MiB of 'real' memory and 23 MiB of 'virtual' memory.

I tried Perl 5.10.0 (the oldest version I have compiled on my machine), and it used slightly less 'real' memory, essentially the same 'virtual' memory, and was noticeably slower (33.3s for 512 lines; 1m 53.9s for 1024 lines).

Just for comparison purposes, I collected some C code that I had lying around in the test directory to create a file of about 88 KiB, with 3100 lines of which about 200 were comment lines. This compares with the size of the data.1024 file which was about 77 KiB. Processing that took between 10 and 20 milliseconds.

Summary

The non-C source you have makes a very nasty test case. Perl shouldn't crash on it.

Which version of Perl are you using, and on which platform? How much memory does your machine have. However, total quantity of memory is unlikely to be the issue (24 MiB is not an issue on most machines that run Perl). If you have a very old version of Perl, the results might be different.


I also note that the regex does not handle some pathological C comments that a C compiler must handle, such as:

/\
\
* Yes, this is a comment *\
\
/
/\
\
/ And so is this

Yes, you'd be right to reject any code submitted for review that contained such comments.


Need Your Help

C++ : Restrict method access in derived class

c++ inheritance

I have an object of class Message, which can be written and subsequently updated. As far as I can see, MessageUpdate IS-A MessageWrite:

Swift 2.0 one-to-many relationship - still NSOrderedSet?

ios swift core-data nsorderedset

In Swift 2.0, are one-to-many relationships in Core Data still NSOrderedSets?

About UNIX Resources Network

Original, collect and organize Developers related documents, information and materials, contains jQuery, Html, CSS, MySQL, .NET, ASP.NET, SQL, objective-c, iPhone, Ruby on Rails, C, SQL Server, Ruby, Arrays, Regex, ASP.NET MVC, WPF, XML, Ajax, DataBase, and so on.