What is the fastest way to parse large XML docs in Python?

I am currently the following code based on Chapter 12.5 of the Python Cookbook:

from xml.parsers import expat

class Element(object):
    def __init__(self, name, attributes):
        self.name = name
        self.attributes = attributes
        self.cdata = ''
        self.children = []
    def addChild(self, element):
        self.children.append(element)
    def getAttribute(self,key):
        return self.attributes.get(key)
    def getData(self):
        return self.cdata
    def getElements(self, name=''):
        if name:
            return [c for c in self.children if c.name == name]
        else:
            return list(self.children)

class Xml2Obj(object):
    def __init__(self):
        self.root = None
        self.nodeStack = []
    def StartElement(self, name, attributes):
        element = Element(name.encode(), attributes)
        if self.nodeStack:
            parent = self.nodeStack[-1]
            parent.addChild(element)
        else:
            self.root = element
        self.nodeStack.append(element)
    def EndElement(self, name):
        self.nodeStack.pop()
    def CharacterData(self,data):
        if data.strip():
            data = data.encode()
            element = self.nodeStack[-1]
            element.cdata += data
    def Parse(self, filename):
        Parser = expat.ParserCreate()
        Parser.StartElementHandler = self.StartElement
        Parser.EndElementHandler = self.EndElement
        Parser.CharacterDataHandler = self.CharacterData
        ParserStatus = Parser.Parse(open(filename).read(),1)
        return self.root

I am working with XML docs about 1 GB in size. Does anyone know a faster way to parse these?

Answers


I looks to me as if you do not need any DOM capabilities from your program. I would second the use of the (c)ElementTree library. If you use the iterparse function of the cElementTree module, you can work your way through the xml and deal with the events as they occur.

Note however, Fredriks advice on using cElementTree iterparse function:

to parse large files, you can get rid of elements as soon as you’ve processed them:

for event, elem in iterparse(source):
    if elem.tag == "record":
        ... process record elements ...
        elem.clear()

The above pattern has one drawback; it does not clear the root element, so you will end up with a single element with lots of empty child elements. If your files are huge, rather than just large, this might be a problem. To work around this, you need to get your hands on the root element. The easiest way to do this is to enable start events, and save a reference to the first element in a variable:

# get an iterable
context = iterparse(source, events=("start", "end"))

# turn it into an iterator
context = iter(context)

# get the root element
event, root = context.next()

for event, elem in context:
    if event == "end" and elem.tag == "record":
        ... process record elements ...
        root.clear()

The lxml.iterparse() does not allow this.


Need Your Help

Should a legacy Android application be rebuilt using SDK 2.1?

android legacy

I have an Android application that uses the well known Strategies for Legacy Applications. It is build with the Android SDK 2.0 with manifest settings minSdkVersion="3" (API 1.5) and

matlab R2011a unable to compile mex files

matlab gcc mex

I have Xcode(4.4.1) installed on my mac, and use MATLAB r2011a (64 bit) on the computer. I have ben trying to compile mex files, but I get the following error:

About UNIX Resources Network

Original, collect and organize Developers related documents, information and materials, contains jQuery, Html, CSS, MySQL, .NET, ASP.NET, SQL, objective-c, iPhone, Ruby on Rails, C, SQL Server, Ruby, Arrays, Regex, ASP.NET MVC, WPF, XML, Ajax, DataBase, and so on.