python - Python と minidom による XML 解析

Question

Python (minidom) を使用して、次のような階層構造を出力する XML ファイルを解析しています (ここでは、重要な階層関係を示すためにインデントを使用しています)。

My Document
Overview
    Basic Features
    About This Software
        Platforms Supported

代わりに、プログラムはノードに対して複数回反復し、次の出力重複ノードを生成します。(各反復でノードリストを見ると、なぜこれを行うのかは明らかですが、探しているノードリストを取得する方法が見つからないようです。)

My Document
Overview
Basic Features
About This Software
Platforms Supported
Basic Features
About This Software
Platforms Supported
Platforms Supported

XML ソースファイルは次のとおりです。

<?xml version="1.0" encoding="UTF-8"?>
<DOCMAP>
    <Topic Target="ALL">
        <Title>My Document</Title>
    </Topic>
    <Topic Target="ALL">
        <Title>Overview</Title>
        <Topic Target="ALL">
            <Title>Basic Features</Title>
        </Topic>
        <Topic Target="ALL">
            <Title>About This Software</Title>
            <Topic Target="ALL">
                <Title>Platforms Supported</Title>
            </Topic>
        </Topic>
    </Topic>
</DOCMAP>

Python プログラムは次のとおりです。

import xml.dom.minidom
from xml.dom.minidom import Node

dom = xml.dom.minidom.parse("test.xml")
Topic=dom.getElementsByTagName('Topic')
i = 0
for node in Topic:
    alist=node.getElementsByTagName('Title')
    for a in alist:
        Title= a.firstChild.data
        print Title

下位レベルのトピック名を「SubTopic1」や「SubTopic2」などに変更することで、「Topic」要素をネストしないことで問題を解決できました。しかし、さまざまな要素名を必要とせずに、組み込みの XML 階層構造を利用したいと考えています。「トピック」要素をネストできるはずであり、現在どのレベルの「トピック」を見ているのかを知る方法があるはずです。

私は多くの異なる XPath 関数を試しましたが、あまり成功しませんでした。

score 10 · Accepted Answer

getElementsByTagName は再帰的です。一致する tagName を持つすべての子孫を取得します。トピックにはタイトルを持つ他のトピックが含まれているため、呼び出しは下位のタイトルを何度も取得します。

一致するすべての直接の子のみを要求したいが、XPath を使用できない場合は、次のような簡単なフィルターを記述できます。

def getChildrenByTagName(node, tagName):
    for child in node.childNodes:
        if child.nodeType==child.ELEMENT_NODE and (tagName=='*' or child.tagName==tagName):
            yield child

for topic in document.getElementsByTagName('Topic'):
    title= list(getChildrenByTagName('Title'))[0]         # or just get(...).next()
    print title.firstChild.data

score 8 · Accepted Answer

以下の作品：

import xml.dom.minidom
from xml.dom.minidom import Node

dom = xml.dom.minidom.parse("docmap.xml")

def getChildrenByTitle(node):
    for child in node.childNodes:
        if child.localName=='Title':
            yield child

Topic=dom.getElementsByTagName('Topic')
for node in Topic:
    alist=getChildrenByTitle(node)
    for a in alist:
        Title= a.childNodes[0].nodeValue
        print Title

score 3 · Accepted Answer

次のジェネレーターを使用して、リストを実行し、インデントレベルのタイトルを取得できます。

def f(elem, level=-1):
    if elem.nodeName == "Title":
        yield elem.childNodes[0].nodeValue, level
    elif elem.nodeType == elem.ELEMENT_NODE:
        for child in elem.childNodes:
            for e, l in f(child, level + 1):
                yield e, l

ファイルでテストする場合：

import xml.dom.minidom as minidom
doc = minidom.parse("test.xml")
list(f(doc))

次のタプルを含むリストを取得します。

(u'My Document', 1), 
(u'Overview', 1), 
(u'Basic Features', 2), 
(u'About This Software', 2), 
(u'Platforms Supported', 3)

もちろん、微調整するのは基本的な考え方にすぎません。先頭にスペースが必要な場合は、ジェネレーターで直接コーディングできますが、レベルを使用すると柔軟性が向上します。最初のレベルを自動的に検出することもできます (ここでは、レベルを -1 に初期化するのがうまくいきません...)。

score 2 · Accepted Answer

反発機能：

import xml.dom.minidom

def traverseTree(document, depth=0):
  tag = document.tagName
  for child in document.childNodes:
    if child.nodeType == child.TEXT_NODE:
      if document.tagName == 'Title':
        print depth*'    ', child.data
    if child.nodeType == xml.dom.Node.ELEMENT_NODE:
      traverseTree(child, depth+1)

filename = 'sample.xml'
dom = xml.dom.minidom.parse(filename)
traverseTree(dom.documentElement)

あなたのxml：

<?xml version="1.0" encoding="UTF-8"?>
<DOCMAP>
    <Topic Target="ALL">
        <Title>My Document</Title>
    </Topic>
    <Topic Target="ALL">
        <Title>Overview</Title>
        <Topic Target="ALL">
            <Title>Basic Features</Title>
        </Topic>
        <Topic Target="ALL">
            <Title>About This Software</Title>
            <Topic Target="ALL">
                <Title>Platforms Supported</Title>
            </Topic>
        </Topic>
    </Topic>
</DOCMAP>

希望する出力：

 $ python parse_sample.py 
      My Document
      Overview
          Basic Features
          About This Software
              Platforms Supported

python - Python と minidom による XML 解析

5 に答える 5

Related

Reference