python - Pythonのlxml xpath、欠落しているタグを処理する方法は?

Question

次のxmlをlxml xpath式で解析したいとします

<pack xmlns="http://ns.qubic.tv/2010/item">
    <packitem>
        <duration>520</duration>
        <max_count>14</max_count>
    </packitem>
    <packitem>
        <duration>12</duration>
    </packitem>
</pack>

これはhttp://python-thoughts.blogspot.fr/2012/01/default-value-for-text-function-using.htmlにあるもののバリエーションです

一度圧縮されたさまざまな要素を解析するにはどうすればよいですか (zip または izip python 関数の意味で)

[(520,14),(12,なし)]

?

max_count2 番目の packitem でタグが欠落しているため、必要なものを取得できません。

score 3 · Accepted Answer

def lxml_empty_str(context, nodes):
    for node in nodes:
        node.text = node.text or ""
    return nodes

ns = etree.FunctionNamespace('http://ns.qubic.tv/lxmlfunctions')
ns['lxml_empty_str'] = lxml_empty_str

namespaces = {'i':"http://ns.qubic.tv/2010/item",
          'f': "http://ns.qubic.tv/lxmlfunctions"}
packitems_duration = root.xpath('f:lxml_empty_str('//b:pack/i:packitem/i:duration)/text()',
namespaces={'b':billing_ns, 'f' : 'http://ns.qubic.tv/lxmlfunctions'})
packitems_max_count = root.xpath('f:lxml_empty_str('//b:pack/i:packitem/i:max_count)    /text()',
namespaces={'b':billing_ns, 'f' : 'http://ns.qubic.tv/lxmlfunctions'})
packitems = zip(packitems_duration, packitems_max_count)

>>> packitems
[('520','14'), ('','23')]

http://python-thoughts.blogspot.fr/2012/01/default-value-for-text-function-using.html

score 1 · Accepted Answer

を使用xpathして s を見つけてから、もう一度packitem呼び出してxpath(またはfindtext以下で行うように) durationandを見つけることができますmax_count。何度も呼び出さxpathなければならないことは、それほど速くないかもしれませんが、うまくいきます.

import lxml.etree as ET

content = '''<pack xmlns="http://ns.qubic.tv/2010/item">
    <packitem>
        <duration>520</duration>
        <max_count>14</max_count>
    </packitem>
    <packitem>
        <duration>12</duration>
    </packitem>
</pack>
'''

def make_int(text):
    try:
        return int(text)
    except TypeError:
        return None

namespaces = {'ns' : 'http://ns.qubic.tv/2010/item'}
doc = ET.fromstring(content)
result = [tuple([make_int(elt.findtext(path, namespaces = namespaces))
                           for path in ('ns:duration', 'ns:max_count')])
          for elt in doc.xpath('//ns:packitem', namespaces = namespaces) ]
print(result)
# [(520, 14), (12, None)]

別のアプローチは、SAX パーサーを使用することです。これは少し速いかもしれませんが、少し多くのコードを必要とし、XML が巨大でない場合、速度の違いは重要ではないかもしれません。

python - Pythonのlxml xpath、欠落しているタグを処理する方法は?

2 に答える 2

Related

Reference