Python XML file format parsing

XML refers to Extensible Markup Language, a subset of the Standard Generalized Markup Language, a markup language used to mark up electronic documents to make them structured. XML is designed to transfer and store data.

Python has three common ways of parsing XML: SAX (simple API for XML), DOM (Document Object Model), and ElementTree.

DOM approach: DOM, translated as Document Object Model, is a standard programming interface recommended by the W3C organization that parses XML data into a tree in memory, and manipulates XML by manipulating the tree.
SAX: SAX is an event-driven model for handling XML. It scans documents line by line, parsing them as it goes, and has great advantages for parsing large documents.
ElementTree approach: ElementTree has better performance compared to DOM, about the same performance as SAX, and an easy to use API.

Python has very many other parsing tools in addition to the built-in xml parser. Which tool is easier to use and has better performance? Let’s explore.

xml.dom.* module

xml.dom implements the DOM API developed by the W3C. xml.dom is a package you can use if you are used to using the DOM API. xml.dom parses XML data into a tree in memory and manipulates XML by manipulating the tree. a DOM parser reads the whole document at once when parsing an XML document, saving all the elements of the document in memory in You can then use the different functions provided by the DOM to read or modify the content and structure of the document, or you can write the modified content to the xml file.

Note: There are many modules in the xml.dom package, note the differences between them.

minidom is an extremely simplified implementation of the DOM API, much simpler than the full version of DOM, and the package is also much smaller.
The pulldom module provides a “pull parser”, the basic concept behind which refers to pulling events from the XML stream and then processing them.

# <?xml version="1.0" encoding="UTF-8"?>
#  <employees>
#   <employee>
#     <name>linux</name>
#     <age>30</age>
#   </employee>
#   <employee>
#     <name>windows</name>
#     <age>20</age>
#   </employee>
#  </employees>

from xml.dom import minidom
doc = minidom.parse("employees.xml")
root = doc.documentElement
employees = root.getElementsByTagName("employee")
for employee in employees:
    print (employee.nodeName)
    print (employee.toxml())
    nameNode = employee.getElementsByTagName("name")[0]
    print (nameNode.childNodes)
    print (nameNode.nodeName + ":" + nameNode.childNodes[0].nodeValue)
    ageNode = employee.getElementsByTagName("age")[0]
    print (ageNode.childNodes)
    print (ageNode.nodeName + ":" + ageNode.childNodes[0].nodeValue)
    for n in employee.childNodes:
        print (n)

xml.sax.* module

The xml.sax.* module is an implementation of the SAX API. This module sacrifices convenience for speed and memory footprint.SAX (simple API for XML), is based on event handling, when XML documents are read sequentially, each time an element is encountered the corresponding event handler function is triggered to handle it.

SAX features.

is an event-based API
operates at a lower level than the DOM
gives you more control than the DOM
is almost always more efficient than the DOM
but unfortunately, requires more work than the DOM

Parsing XML with Python requires import xml.sax and xml.sax.handler

# <?xml version="1.0"?>
# <collection shelf="New Arrivals">
#     <movie title="Enemy Behind">
#        <type>War, Thriller</type>
#        <format>DVD</format>
#        <year>2003</year>
#        <rating>PG</rating>
#        <stars>10</stars>
#        <description>Talk about a US-Japan war</description>
#     </movie>
#     <movie title="Transformers">
#        <type>Anime, Science Fiction</type>
#        <format>DVD</format>
#        <year>1989</year>
#        <rating>R</rating>
#        <stars>8</stars>
#        <description>A schientific fiction</description>
#     </movie>
#        <movie title="Trigun">
#        <type>Anime, Action</type>
#        <format>DVD</format>
#        <episodes>4</episodes>
#        <rating>PG</rating>
#        <stars>10</stars>
#        <description>Vash the Stampede!</description>
#     </movie>
#     <movie title="Ishtar">
#        <type>Comedy</type>
#        <format>VHS</format>
#        <rating>PG</rating>
#        <stars>2</stars>
#        <description>Viewable boredom</description>
#     </movie>
# </collection>

import xml.sax

class MovieHandler( xml.sax.ContentHandler):
   def __init__(self):
      self.CurrentData = ""
      self.type = ""
      self.format = ""
      self.year = ""
      self.rating = ""
      self.stars = ""
      self.description = ""

   # 元素开始事件处理
   def startElement(self, tag, attributes):
      self.CurrentData = tag
      if tag == "movie":
         print "*****Movie*****"
         title = attributes["title"]
         print "Title:", title

   # 元素结束事件处理
   def endElement(self, tag):
      if self.CurrentData == "type":
         print "Type:", self.type
      elif self.CurrentData == "format":
         print "Format:", self.format
      elif self.CurrentData == "year":
         print "Year:", self.year
      elif self.CurrentData == "rating":
         print "Rating:", self.rating
      elif self.CurrentData == "stars":
         print "Stars:", self.stars
      elif self.CurrentData == "description":
         print "Description:", self.description
      self.CurrentData = ""

   # 内容事件处理
   def characters(self, content):
      if self.CurrentData == "type":
         self.type = content
      elif self.CurrentData == "format":
         self.format = content
      elif self.CurrentData == "year":
         self.year = content
      elif self.CurrentData == "rating":
         self.rating = content
      elif self.CurrentData == "stars":
         self.stars = content
      elif self.CurrentData == "description":
         self.description = content

if ( __name__ == "__main__"):
   # 创建一个 XMLReader
   parser = xml.sax.make_parser()
   # turn off namepsaces
   parser.setFeature(xml.sax.handler.feature_namespaces, 0)
   # 重写 ContextHandler
   Handler = MovieHandler()
   parser.setContentHandler( Handler )
   parser.parse("movies.xml")

xml.parser.expat

xml.parser.expat provides a direct, underlying API interface to the expat parser written in C. The expat interface is similar to SAX in that it is also based on an event callback mechanism, but this interface is not standardized and only applies to the expat library.

import xml.parsers.expat

class ExParser(object):
    '''Parse roster xml'''
    def __init__(self, xml_raw):
        '''init parser and setup handlers'''
        self.parser = xml.parsers.expat.ParserCreate()

        #connect handlers
        self.parser.StartElementHandler = self.start_element
        self.parser.EndElementHandler = self.end_element
        self.parser.CharacterDataHandler = self.char_data
        self.parser.Parse(xml_raw)
        del(xml_raw)

    def start_element(self, name, attrs):
        '''Start xml element handler'''
        print('start:'+name)

    def end_element(self, name):
        '''End xml element handler'''
        print('end:'+name)

    def char_data(self, data):
        '''Char xml element handler'''
        print('data is '+data)

ElementTree

The xml.etree.ElementTree module provides a lightweight, Pythonic API along with an efficient C implementation, xml.etree.cElementTree. ET is faster and the API is more straightforward and easy to use than the DOM. Compared to SAX, the ET.iterparse function also provides on-demand parsing without reading the entire document in memory at once. ET’s performance is roughly similar to the SAX module, but it has a higher-level API that is easier for users to use.

There are two implementations of ElementTree in the Python standard library. One is a pure Python implementation such as xml.etree.ElementTree, and the other is the faster xml.etree.cElementTree. Remember: try to use the C implementation, because it’s faster and consumes less memory.

# <?xml version="1.0"?>
# <doc>
#     <branch name="testing" hash="1cdf045c">
#         text,source
#     </branch>
#     <branch name="release01" hash="f200013e">
#         <sub-branch name="subrelease01">
#             xml,sgml
#         </sub-branch>
#     </branch>
#     <branch name="invalid">
#     </branch>
# </doc>

try:
    import xml.etree.cElementTree as ET
except ImportError:
    import xml.etree.ElementTree as ET

tree = ET.ElementTree(file='doc1.xml')
root = tree.getroot()
print root.tag, root.attrib
for child_of_root in root:
    print child_of_root.tag, child_of_root.attrib
for elem in tree.iter():
    print elem.tag, elem.attrib
for elem in tree.iter(tag='branch'):
    print elem.tag, elem.attrib
for elem in tree.iterfind('branch/sub-branch'):
    print elem.tag, elem.attrib
for elem in tree.iterfind('branch[@name="release01"]'):
    print elem.tag, elem.attrib

Element object

class xml.etree.ElementTree.Element(tag, attrib={}, **extra)

　　tag：string，元素代表的数据种类。
　　text：string，元素的内容。
　　tail：string，元素的尾形。
　　attrib：dictionary，元素的属性字典。
　　
　　#针对属性的操作
　　clear()：清空元素的后代、属性、text和tail也设置为None。
　　get(key, default=None)：获取key对应的属性值，如该属性不存在则返回default值。
　　items()：根据属性字典返回一个列表，列表元素为(key, value）。
　　keys()：返回包含所有元素属性键的列表。
　　set(key, value)：设置新的属性键与值。

　　#针对后代的操作
　　append(subelement)：添加直系子元素。
　　extend(subelements)：增加一串元素对象作为子元素。＃python2.7新特性
　　find(match)：寻找第一个匹配子元素，匹配对象可以为tag或path。
　　findall(match)：寻找所有匹配子元素，匹配对象可以为tag或path。
　　findtext(match)：寻找第一个匹配子元素，返回其text值。匹配对象可以为tag或path。
　　insert(index, element)：在指定位置插入子元素。
　　iter(tag=None)：生成遍历当前元素所有后代或者给定tag的后代的迭代器。＃python2.7新特性
　　iterfind(match)：根据tag或path查找所有的后代。
　　itertext()：遍历所有后代并返回text值。
　　remove(subelement)：删除子元素。

ElementTree object

class xml.etree.ElementTree.ElementTree(element=None, file=None)
　　element如果给定，则为新的ElementTree的根节点。

　　_setroot(element)：用给定的element替换当前的根节点。慎用。
　　
　　# 以下方法与Element类中同名方法近似，区别在于它们指定以根节点作为操作对象。
　　find(match)
　　findall(match)
　　findtext(match, default=None)
　　getroot()：获取根节点.
　　iter(tag=None)
　　iterfind(match)
　　parse(source, parser=None)：装载xml对象，source可以为文件名或文件类型对象.
　　write(file, encoding="us-ascii", xml_declaration=None, default_namespace=None,method="xml")

Module Method

xml.etree.ElementTree.Comment(text=None)
创建一个特别的element，通过标准序列化使其代表了一个comment。comment可以为bytestring或unicode。

xml.etree.ElementTree.dump(elem)
生成一个element tree，通过sys.stdout输出，elem可以是元素树或单个元素。这个方法最好只用于debug。

xml.etree.ElementTree.fromstring(text)
text是一个包含XML数据的字符串，与XML()方法类似，返回一个Element实例。

xml.etree.ElementTree.fromstringlist(sequence, parser=None)
从字符串的序列对象中解析xml文档。缺省parser为XMLParser，返回Element实例。

xml.etree.ElementTree.iselement(element)
检查是否是一个element对象。

xml.etree.ElementTree.iterparse(source, events=None, parser=None)
将文件或包含xml数据的文件对象递增解析为element tree，并且报告进度。events是一个汇报列表，如果忽略，将只有end事件会汇报出来。
注意，iterparse()只会在看见开始标签的">"符号时才会抛出start事件，因此届时属性是已经定义了，但是text和tail属性在那时还没有定义，同样子元素也没有定义，因此他们可能不能被显示出来。如果你想要完整的元素，请查找end事件。

xml.etree.ElementTree.parse(source, parser=None)
将一个文件或者字符串解析为element tree。

xml.etree.ElementTree.ProcessingInstruction(target, text=None)
这个方法会创建一个特别的element，该element被序列化为一个xml处理命令。

xml.etree.ElementTree.register_namespace(prefix, uri)
注册命名空间前缀。这个注册是全局有效，任何已经给出的前缀或者命名空间uri的映射关系会被删除。

xml.etree.ElementTree.SubElement(parent, tag, attrib={}, **extra)
子元素工厂，创建一个Element实例并追加到已知的节点。

xml.etree.ElementTree.tostring(element, encoding="us-ascii", method="xml")
生成一个字符串来表示表示xml的element，包括所有子元素。element是Element实例，method为"xml","html","text"。返回包含了xml数据的字符串。

xml.etree.ElementTree.tostringlist(element, encoding="us-ascii", method="xml")
生成一个字符串来表示表示xml的element，包括所有子元素。element是Element实例，method为"xml","html","text"。返回包含了xml数据的字符串列表。

xml.etree.ElementTree.XML(text, parser=None)
从一个字符串常量中解析出xml片段。返回Element实例。

xml.etree.ElementTree.XMLID(text, parser=None)
从字符串常量解析出xml片段，同时返回一个字典，用以映射element的id到其自身。

xmltodict

xmltodict is a Python module that allows you to process XML as if you were processing JSON.

For an XML file like this.

<mydocument has="an attribute">
  <and>
    <many>elements</many>
    <many>more elements</many>
  </and>
  <plus a="complex">
    element as well
  </plus>
</mydocument>

It can be loaded into a Python dictionary at

import xmltodict

with open('path/to/file.xml') as fd:
    obj = xmltodict.parse(fd.read())

You can access elements, attributes and values.

doc['mydocument']['@has'] # == u'an attribute'
doc['mydocument']['and']['many'] # == [u'elements', u'more elements']
doc['mydocument']['plus']['@a'] # == u'complex'
doc['mydocument']['plus']['#text'] # == u'element as well'

xmltodict also has an unparse function that allows you to convert back to XML. this function has a streaming mode suitable for handling files that cannot be put into memory, and it also supports namespaces.

dicttoxml is a tool for converting dictionaries to xml, for those interested.

untangle

The untangle library maps an XML document to a Python object that contains the node and attribute information of the original document in its structure.

Example.

<?xml version="1.0"?>
<root>
    <child name="child1">
</root>

It can be loaded as follows.

1
2
3

import untangle

obj = untangle.parse('path/to/file.xml')

Then you can get the child element name like this.

`1`	`obj.root.child['name']`

untangle also supports loading XML from strings or URLs.

Other Tools

lxml: Data extraction tool lxml and xpath
BeautifulSoup: Python data parsing tool: Beautiful Soup
Parsel: Scrapy’s own HTML and XML parsing tool that can use XPath, CSS selectors or regular expressions for data extraction.
xmldataset: A Python library that simplifies the extraction of datasets from XML content.

Table of Contents