How to trigger Perl multiline substitution

I have a folder of HTML files which have the below DOCTYPE declaration which I need to remove, so that a not-very-good parser can successfully load it as XML.

I've been trying to use perl to do the substitution in place, but no change is made when I run the substitution and I can't figure out why. Can anyone identify the correct flags or specification I need to make in order to remove the DOCTYPE processing instruction here.

Here's an example file I'd like to manipulate.

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">

<html xmlns="http://www.w3.org/1999/xhtml">
<head>
  <meta name="generator" content=
  "HTML Tidy for Linux/x86 (vers 25 March 2009), see www.w3.org" />
  <title></title>
</head>
  <body>
  </body>
</html>

Here's the perl one-liner I'm trying to use, which looks for the angle brackets, the exclamation mark, and everything before the close angle bracket. It incorporates perl substitution flags which other postings suggest should work for a multiline match - m for multiline, s for allowing newlines to be matched by regex. I'm then replacing the match with the empty string.

perl -i -e 's/<![^>]+>//gsm' `find . -name '*.html'`

I can't figure out why, but the DOCTYPE is not removed from the file after running this command. Does anyone else know why?

Answers


What you need is the -0777 switch which will cause the entire file to be read into a single string. If this is not used, the files will be read in line-by-line mode, and you can never match a multi-line statement that way.

Also, as Andomar points out, you are missing the -p switch, but I assume you figured that out.

The modifiers on the regex won't matter in this case, except the /g modifier. /m only affects ^ and $, and /s causes wildcard . to also match newlines. None of this applies to your regex.

So basically, you want something like:

perl -0777 -pi -e 's/<![^>]+>//g' ...

Side note:

Html should be handled with parsers, ideally, so I spent a few minutes working on using HTML::Parser which has a convenient option to strip declarations by adding a handler. Something like this seems to print ok for a single file:

perl -MHTML::Parser -we '
    $p = HTML::Parser->new(default_h => [sub {print @_},'text'] ); 
    $p->handler(declaration => ''); 
    $p->parse_file(shift) or die $!; " yourfile.html

I figured it would be overkill so I abandoned trying to fix it with the -pi in-place edit switches, but it is (probably) easily implemented in a script.


First, you seem to be missing the -p parameter, for processing the input line by line. -i doesn't seem to do much without -p.

Second, since -pi processes the input line-by-line, it can't replace a regex that spans more than one line.

You could write a Perl script instead. This script should run your regex on the entire content of all files passed on the command line:

use IO::All;

foreach my $file (@ARGV) {
    my $content = io($file)->slurp;
    $content =~ s/<![^>]+>//g;
    $content > io($file);
}

The command cpan IO:All should install the IO:All module, if it is not present on your system.

P.S. The m and s options only affect ., ^ and $. I think you can omit them.


Need Your Help

riemann email setup issue

clojure riemann

I am trying to set up riemann (for monitoring) with email alerts. I have used the following section in my riemann.config file but after reloading the config, I get the error copied below. Any tho...

Fancybox and Jcrop on IE

jquery internet-explorer fancybox jcrop

Jcrop is working good with fancybox on all web browsers, except on Internet Explorer (IE). The cropping tools seems not moveable. Is there any solution? Thanks.

About UNIX Resources Network

Original, collect and organize Developers related documents, information and materials, contains jQuery, Html, CSS, MySQL, .NET, ASP.NET, SQL, objective-c, iPhone, Ruby on Rails, C, SQL Server, Ruby, Arrays, Regex, ASP.NET MVC, WPF, XML, Ajax, DataBase, and so on.