How can I prevent XML::XPath from fetching a DTD while processing an XML file?
My XML (a.xhtml) starts like this
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> ...
My code starts like this
use XML::XPath; use XML::XPath::XMLParser; my $xp = XML::XPath->new(filename => "a.xhtml"); my $nodeset = $xp->find('/html/body//table');
It's very slow, and it turns out that it spends a lot of time getting the DTD (http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd).
Is there a way to explicitly declare an HTTP proxy server in the Perl XML:: family? I hate to modify the original a.xhtml document like having a local copy of the DTD.
XML::XPath is based on XML::Parser. There is an option in XML::Parser to NOT use LWP to resolve external entities (such as DTDs). And XML::XPath lets you pass an XML::Parser objetc, to use as the parser.
So you can write this:
my $p = XML::Parser->new( NoLWP => 1); my $xp= XML::XPath->new( parser => $p, filename => "a.xhtml");
Note that in this case you will loose all entities except numerical ones and the default ones (>, <, &, ' and "). The parser will not complain, but they will disappear silently (try including α in the table and printing it for example).
As a matter of fact you probably should not use XML::XPath, which is not actively maintained.
Try XML::LibXML, if you have no problem with installing libxml2, its interface is very similar to XML::XPath as they both implement the DOM. XML::LibXML is also much more powerful than XML::XPath, and faster to boot. If you want an expat/XML::Parser based module, they you might want to have a look at XML::Twig (that's blatant self-promotion as I am the author of the module, sorry). Also for HTML/dodgy XHTML, you can use HTML::TreeBuilder, which, with the addition of HTML::TreeBuilder::XPath (also by me), supports XPath.