第23章-XML模块¶

python具有内置的XML解析功能，您可以通过它 xml 模块。在本文中，我们将重点介绍XML模块的两个子模块：

微型计算机

ElementTree

我们将从minidom开始，因为它曾经是XML解析的实际方法。然后我们来看看如何使用elementtree。

使用minidom¶

首先，我们需要一些实际的XML来解析。请看一下以下XML的简短示例：

<?xml version="1.0" ?>
<zAppointments reminder="15">
    <appointment>
        <begin>1181251680</begin>
        <uid>040000008200E000</uid>
        <alarmTime>1181572063</alarmTime>
        <state></state>
        <location></location>
        <duration>1800</duration>
        <subject>Bring pizza home</subject>
    </appointment>
</zAppointments>

这是非常典型的XML，实际上非常直观。您可能不得不使用一些非常讨厌的XML。无论如何，请使用以下名称保存上面的XML代码： appt.xml

让我们花些时间熟悉如何使用python的 微型计算机 模块。这是一段相当长的代码，请做好准备。

import xml.dom.minidom
import urllib.request

class ApptParser(object):

    def __init__(self, url, flag='url'):
        self.list = []
        self.appt_list = []
        self.flag = flag
        self.rem_value = 0
        xml = self.getXml(url)
        self.handleXml(xml)

    def getXml(self, url):
        try:
            print(url)
            f = urllib.request.urlopen(url)
        except:
            f = url

        doc = xml.dom.minidom.parse(f)
        node = doc.documentElement
        if node.nodeType == xml.dom.Node.ELEMENT_NODE:
            print('Element name: %s' % node.nodeName)
            for (name, value) in node.attributes.items():
                print('    Attr -- Name: %s  Value: %s' % (name, value))

        return node

    def handleXml(self, xml):
        rem = xml.getElementsByTagName('zAppointments')
        appointments = xml.getElementsByTagName("appointment")
        self.handleAppts(appointments)

    def getElement(self, element):
        return self.getText(element.childNodes)

    def handleAppts(self, appts):
        for appt in appts:
            self.handleAppt(appt)
            self.list = []

    def handleAppt(self, appt):
        begin     = self.getElement(appt.getElementsByTagName("begin")[0])
        duration  = self.getElement(appt.getElementsByTagName("duration")[0])
        subject   = self.getElement(appt.getElementsByTagName("subject")[0])
        location  = self.getElement(appt.getElementsByTagName("location")[0])
        uid       = self.getElement(appt.getElementsByTagName("uid")[0])

        self.list.append(begin)
        self.list.append(duration)
        self.list.append(subject)
        self.list.append(location)
        self.list.append(uid)
        if self.flag == 'file':

            try:
                state     = self.getElement(appt.getElementsByTagName("state")[0])
                self.list.append(state)
                alarm     = self.getElement(appt.getElementsByTagName("alarmTime")[0])
                self.list.append(alarm)
            except Exception as e:
                print(e)

        self.appt_list.append(self.list)

    def getText(self, nodelist):
        rc = ""
        for node in nodelist:
            if node.nodeType == node.TEXT_NODE:
                rc = rc + node.data
        return rc

if __name__ == "__main__":
    appt = ApptParser("appt.xml")
    print(appt.appt_list)

这段代码松散地基于Python文档中的一个示例，我不得不承认，我认为它的变化有点难看。让我们把这段代码分解一下。您在中看到的URL参数 ApptParser 类可以是URL或文件。在 获取XML 方法，我们使用异常处理程序尝试打开URL。如果它恰巧引发了一个错误，那么我们假设URL实际上是一个文件路径。接下来我们使用minidom的解析方法来分析XML。然后我们从XML中拉出一个节点。我们将忽略条件，因为它对本次讨论不重要。最后，我们返回 node 对象。

从技术上讲，节点是XML，我们将其传递给 手持设备 方法。要获取XML中的所有约会实例，我们将执行以下操作：

xml.getElementsByTagName("appointment").

然后我们把信息传递给手跳方法。这是很多传递信息的过程。对这段代码进行一点重构，使其成为一个好主意，这样它就不需要传递信息，而只是设置类变量，然后在不带任何参数的情况下调用下一个方法。我把这个留给读者做练习。不管怎样，所有的手跳方法对每个约会进行循环并调用手跳方法从中提取一些附加信息，将数据添加到列表中，然后将该列表添加到另一个列表中。最终的想法是列出一份清单，上面列出了我所有的约会相关数据。

您会注意到handleapt方法调用 获取元素 方法调用 获取文本 方法。从技术上讲，您可以跳过对getElement的调用，直接调用getText。另一方面，您可能需要向getelement添加一些额外的处理，以便在返回文本之前将文本转换为其他类型。例如，您可能希望将数字转换为整数、浮点数或十进制对象。

在继续之前，让我们用minidom再试一个例子。我们将使用来自Microsoft的msdn网站的XML示例： http://msdn.microsoft.com/en-us/library/ms762271%28VS.85%29.aspx .将以下XML另存为 example.xml

<?xml version="1.0"?>
<catalog>
   <book id="bk101">
      <author>Gambardella, Matthew</author>
      <title>XML Developer's Guide</title>
      <genre>Computer</genre>
      <price>44.95</price>
      <publish_date>2000-10-01</publish_date>
      <description>An in-depth look at creating applications
      with XML.</description>
   </book>
   <book id="bk102">
      <author>Ralls, Kim</author>
      <title>Midnight Rain</title>
      <genre>Fantasy</genre>
      <price>5.95</price>
      <publish_date>2000-12-16</publish_date>
      <description>A former architect battles corporate zombies,
      an evil sorceress, and her own childhood to become queen
      of the world.</description>
   </book>
   <book id="bk103">
      <author>Corets, Eva</author>
      <title>Maeve Ascendant</title>
      <genre>Fantasy</genre>
      <price>5.95</price>
      <publish_date>2000-11-17</publish_date>
      <description>After the collapse of a nanotechnology
      society in England, the young survivors lay the
      foundation for a new society.</description>
   </book>
</catalog>

对于本例，我们只分析XML，提取书籍标题并将其打印到stdout。代码如下：

import xml.dom.minidom as minidom

def getTitles(xml):
    """
    Print out all titles found in xml
    """
    doc = minidom.parse(xml)
    node = doc.documentElement
    books = doc.getElementsByTagName("book")

    titles = []
    for book in books:
        titleObj = book.getElementsByTagName("title")[0]
        titles.append(titleObj)

    for title in titles:
        nodes = title.childNodes
        for node in nodes:
            if node.nodeType == node.TEXT_NODE:
                print(node.data)

if __name__ == "__main__":
    document = 'example.xml'
    getTitles(document)

这段代码只是一个接受一个参数XML文件的短函数。我们导入minidom模块并给它取相同的名称，以便于引用。然后我们分析XML。函数的前两行与前一个示例基本相同。我们使用 GetElementsByTagname 方法获取需要的XML部分，然后迭代结果并从中提取书籍标题。这实际上提取了标题对象，因此我们也需要对其进行迭代，并提取纯文本，这就是为什么我们使用嵌套 for 循环。

现在，让我们花点时间尝试XML模块的另一个子模块，名为 ElementTree .

使用elementtree分析¶

在本节中，您将学习如何创建XML文件、编辑XML并使用elementtree解析XML。为了进行比较，我们将使用上一节中使用的相同XML来说明使用minidom和elementtree之间的区别。这是原始XML：

<?xml version="1.0" ?>
<zAppointments reminder="15">
    <appointment>
        <begin>1181251680</begin>
        <uid>040000008200E000</uid>
        <alarmTime>1181572063</alarmTime>
        <state></state>
        <location></location>
        <duration>1800</duration>
        <subject>Bring pizza home</subject>
    </appointment>
</zAppointments>

让我们从学习如何使用Python以编程方式创建这段XML开始！

如何使用elementtree创建XML¶

使用elementtree创建XML非常简单。在本节中，我们将尝试用Python创建上面的XML。代码如下：

import xml.etree.ElementTree as xml

def createXML(filename):
    """
    Create an example XML file
    """
    root = xml.Element("zAppointments")
    appt = xml.Element("appointment")
    root.append(appt)

    # add appointment children
    begin = xml.SubElement(appt, "begin")
    begin.text = "1181251680"

    uid = xml.SubElement(appt, "uid")
    uid.text = "040000008200E000"

    alarmTime = xml.SubElement(appt, "alarmTime")
    alarmTime.text = "1181572063"

    state = xml.SubElement(appt, "state")

    location = xml.SubElement(appt, "location")

    duration = xml.SubElement(appt, "duration")
    duration.text = "1800"

    subject = xml.SubElement(appt, "subject")

    tree = xml.ElementTree(root)
    with open(filename, "w") as fh:
        tree.write(fh)

if __name__ == "__main__":
    createXML("appt.xml")

如果运行此代码，则应该获得如下内容（可能全部在一行中）：

<zAppointments>
    <appointment>
        <begin>1181251680</begin>
        <uid>040000008200E000</uid>
        <alarmTime>1181572063</alarmTime>
        <state />
        <location />
        <duration>1800</duration>
        <subject />
    </appointment>
</zAppointments>

这与原始文件非常接近，并且确实是有效的XML。虽然不完全一样，但它足够近了。让我们花点时间检查代码并确保我们理解它。首先，我们使用elementtree的element函数创建根元素。然后我们创建一个appointment元素并将其附加到根目录。接下来，我们通过将约会元素对象（appt）与名称一起传递给子元素来创建子元素，如“begin”。然后，对于每个子元素，我们将其文本属性设置为给它一个值。在脚本的末尾，我们创建了一个元素树，并使用它将XML写到一个文件中。

现在我们准备好学习如何编辑文件了！

如何使用elementtree编辑XML¶

使用elementtree编辑XML也很容易。不过，为了让事情变得更有趣，我们将在XML中添加另一个约会块：

<?xml version="1.0" ?>
<zAppointments reminder="15">
    <appointment>
        <begin>1181251680</begin>
        <uid>040000008200E000</uid>
        <alarmTime>1181572063</alarmTime>
        <state></state>
        <location></location>
        <duration>1800</duration>
        <subject>Bring pizza home</subject>
    </appointment>
        <appointment>
        <begin>1181253977</begin>
        <uid>sdlkjlkadhdakhdfd</uid>
        <alarmTime>1181588888</alarmTime>
        <state>TX</state>
        <location>Dallas</location>
        <duration>1800</duration>
        <subject>Bring pizza home</subject>
    </appointment>
</zAppointments>

现在，让我们编写一些代码，将begin标记的每个值从epoch以来的秒数更改为可读性稍高的值。我们用 Python 的 time 方便这一点的模块：

import time
import xml.etree.cElementTree as ET

def editXML(filename):
    """
    Edit an example XML file
    """
    tree = ET.ElementTree(file=filename)
    root = tree.getroot()

    for begin_time in root.iter("begin"):
        begin_time.text = time.ctime(int(begin_time.text))

    tree = ET.ElementTree(root)
    with open("updated.xml", "w") as f:
        tree.write(f)

if __name__ == "__main__":
    editXML("original_appt.xml")

在这里，我们创建一个elementtree对象（tree），然后提取 root 从它开始。然后我们用elementtree的 ITER（）。 方法查找标记为“begin”的所有标记。注意，iter（）方法是在python 2.7中添加的。在for循环中，我们通过 时间.ctime（） .您会注意到，在将字符串传递给CTIME时，我们必须将它转换为整数。输出应该如下所示：

<zAppointments reminder="15">
    <appointment>
        <begin>Thu Jun 07 16:28:00 2007</begin>
        <uid>040000008200E000</uid>
        <alarmTime>1181572063</alarmTime>
        <state />
        <location />
        <duration>1800</duration>
        <subject>Bring pizza home</subject>
    </appointment>
    <appointment>
        <begin>Thu Jun 07 17:06:17 2007</begin>
        <uid>sdlkjlkadhdakhdfd</uid>
        <alarmTime>1181588888</alarmTime>
        <state>TX</state>
        <location>Dallas</location>
        <duration>1800</duration>
        <subject>Bring pizza home</subject>
    </appointment>
</zAppointments>

您也可以使用elementtree的 查找（） 或 芬德尔（） 在XML中搜索特定标记的方法。find（）方法将只查找第一个实例，而find all（）将查找具有指定标签的所有标记。这些有助于编辑或解析，这是我们的下一个主题！

如何用elementtree解析XML¶

现在我们学习如何使用elementtree进行一些基本的解析。首先我们要通读代码，然后一点一点地读，这样我们才能理解它。请注意，此代码是基于原始示例的，但它也应该适用于第二个示例。

import xml.etree.cElementTree as ET

def parseXML(xml_file):
    """
    Parse XML with ElementTree
    """
    tree = ET.ElementTree(file=xml_file)
    print(tree.getroot())
    root = tree.getroot()
    print("tag=%s, attrib=%s" % (root.tag, root.attrib))

    for child in root:
        print(child.tag, child.attrib)
        if child.tag == "appointment":
            for step_child in child:
                print(step_child.tag)

    # iterate over the entire tree
    print("-" * 40)
    print("Iterating using a tree iterator")
    print("-" * 40)
    iter_ = tree.getiterator()
    for elem in iter_:
        print(elem.tag)

    # get the information via the children!
    print("-" * 40)
    print("Iterating using getchildren()")
    print("-" * 40)
    appointments = root.getchildren()
    for appointment in appointments:
        appt_children = appointment.getchildren()
        for appt_child in appt_children:
            print("%s=%s" % (appt_child.tag, appt_child.text))

if __name__ == "__main__":
    parseXML("appt.xml")

您可能已经注意到了这一点，但在本例和最后一个示例中，我们导入的是CelementTree而不是普通的ElementTree。两者之间的主要区别在于，CelementTree是基于C的，而不是基于Python的，因此速度更快。无论如何，我们再次创建一个elementtree对象并从中提取根。您会注意到，我们会打印出根、根的标记和属性。接下来，我们将展示几种遍历标记的方法。第一个循环只是逐子级迭代XML。不过，这将只打印出顶级子级（约会），因此我们添加了一个if语句来检查该子级并对其子级进行迭代。

接下来，我们从树对象本身中获取一个迭代器，并以这种方式对其进行迭代。您可以得到相同的信息，但是在第一个示例中没有额外的步骤。第三种方法使用根的 getchildren（）。 功能。这里我们再次需要一个内部循环来抓取每个预约标签中的所有孩子。最后一个示例使用根的 ITER（）。 方法只循环任何与字符串“begin”匹配的标记。

如前一节所述，您还可以使用 查找（） 或 芬德尔（） 帮助您分别查找特定的标记或标记集。还要注意，每个元素对象都有一个 tag 和A text 属性，可用于获取准确的信息。

总结¶

现在您知道了如何使用minidom解析XML。您还学习了如何使用elementtree创建、编辑和分析XML。在python之外还有其他的库提供了使用XML的额外方法。一定要做一些调查，以确保你使用的工具是你理解的，因为如果你使用的工具是钝的，这个主题会变得非常混乱。