python - Pythonのlxml iterparseは名前空間を処理できません

Question

from lxml import etree
import StringIO

data= StringIO.StringIO('<root xmlns="http://some.random.schema"><a>One</a><a>Two</a><a>Three</a></root>')
docs = etree.iterparse(data,tag='a')
a,b = docs.next()


Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "iterparse.pxi", line 478, in lxml.etree.iterparse.__next__ (src/lxml/lxml.etree.c:95348)
  File "iterparse.pxi", line 534, in lxml.etree.iterparse._read_more_events (src/lxml/lxml.etree.c:95938)
StopIteration

名前空間をルートノードに追加するまでは問題なく動作します。回避策として何ができるか、またはこれを行う正しい方法についてのアイデアはありますか? ファイルが非常に大きいため、イベント駆動型にする必要があります。

score 11 · Accepted Answer

名前空間がアタッチされている場合、タグはではなくaです{http://some.random.schema}a。これを試してください（Python 3）：

from lxml import etree
from io import BytesIO

xml = '''\
<root xmlns="http://some.random.schema">
  <a>One</a>
  <a>Two</a>
  <a>Three</a>
</root>'''
data = BytesIO(xml.encode())
docs = etree.iterparse(data, tag='{http://some.random.schema}a')
for event, elem in docs:
    print(f'{event}: {elem}')

または、Python 2 の場合:

from lxml import etree
from StringIO import StringIO

xml = '''\
<root xmlns="http://some.random.schema">
  <a>One</a>
  <a>Two</a>
  <a>Three</a>
</root>'''
data = StringIO(xml)
docs = etree.iterparse(data, tag='{http://some.random.schema}a')
for event, elem in docs:
    print event, elem

これは次のようなものを出力します：

end: <Element {http://some.random.schema}a at 0x10941e730>
end: <Element {http://some.random.schema}a at 0x10941e8c0>
end: <Element {http://some.random.schema}a at 0x10941e960>

@ mihail-shcheglov が指摘したように、ワイルドカード*も使用できます。これは、任意の名前空間または名前空間なしで機能します。

from lxml import etree
from io import BytesIO

xml = '''\
<root xmlns="http://some.random.schema">
  <a>One</a>
  <a>Two</a>
  <a>Three</a>
</root>'''
data = BytesIO(xml.encode())
docs = etree.iterparse(data, tag='{*}a')
for event, elem in docs:
    print(f'{event}: {elem}')

詳細については、 lxml.etree のドキュメントを参照してください。

score -3 · Accepted Answer

正規表現を使用しないのはなぜですか?

1)

lxml を使用すると、正規表現を使用するよりも遅くなります。

from time import clock
import StringIO



from lxml import etree

times1 = []
for i in xrange(1000):
    data= StringIO.StringIO('<root ><a>One</a><a>Two</a><a>Three\nlittle pigs</a><b>Four</b><a>another</a></root>')
    te = clock()
    docs = etree.iterparse(data,tag='a')
    tf = clock()
    times1.append(tf-te)
print min(times1)

print [etree.tostring(y) for x,y in docs]




import re

regx = re.compile('<a>[\s\S]*?</a>')

times2 = []
for i in xrange(1000):
    data= StringIO.StringIO('<root ><a>One</a><a>Two</a><a>Three\nlittle pigs</a><b>Four</b><a>another</a></root>')
    te = clock()
    li = regx.findall(data.read())
    tf = clock()
    times2.append(tf-te)
print min(times2)

print li

結果

0.000150298431784
['<a>One</a>', '<a>Two</a>', '<a>Three\nlittle pigs</a>', '<a>another</a>']
2.40253998762e-05
['<a>One</a>', '<a>Two</a>', '<a>Three\nlittle pigs</a>', '<a>another</a>']

0.000150298431784 / 2.40253998762e-05 は 6.25
lxml は正規表現より 6.25 倍遅い

.

2)

名前空間が次の場合は問題ありません:

import StringIO
import re

regx = re.compile('<a>[\s\S]*?</a>')

data= StringIO.StringIO('<root xmlns="http://some.random.schema"><a>One</a><a>Two</a><a>Three\nlittle pigs</a><b>Four</b><a>another</a></root>')
print regx.findall(data.read())

結果

['<a>One</a>', '<a>Two</a>', '<a>Three\nlittle pigs</a>', '<a>another</a>']

python - Pythonのlxml iterparseは名前空間を処理できません

2 に答える 2

1)

2)

Related

Reference