python - Python 3 で TEI をトラバースすると、一部のエンティティでテキストが空になる

Question

次のようなエンティティを含む TEI でエンコードされた xml ファイルがあります。

<sp>
    <speaker rend="italic">Sampson.</speaker>
    <ab>
         <lb n="5"/>
         <hi rend="italic">Gregory:</hi>
         <seg type="homograph">A</seg> my word wee'l not carry coales.<lb n="6"/>
    </ab>
</sp>
<sp>
     <speaker rend="italic">Greg.</speaker>
     <ab>No, for then we should be Colliars.
         <lb n="7" rend="rj"/>
     </ab>
</sp>

完全なファイルは非常に大きいですが、http: //ota.ox.ac.uk/desc/5721からアクセスできます。Python 3 を使用して xml をトラバースし、ダイアログが見つかったタグに関連付けられているすべてのテキストを取得しようとしています。

import xml.etree.ElementTree as etree
tree = etree.parse('romeo_juliet_5721.xml')
doc = tree.getroot()
for i in doc.iter(tag='{http://www.tei-c.org/ns/1.0}ab'):   
        print(i.tag, i.text)
>>> http://www.tei-c.org/ns/1.0}ab 
>>>                  
>>> {http://www.tei-c.org/ns/1.0}ab No, for then we should be Colliars.

出力はエンティティを問題なくキャッチしますが、最初の ab のテキストとして「私の言葉は石炭を運ぶことはできません」を認識しません。それが別の要素内にある場合、私はそれを見ていません。要素全体を文字列に変換し、正規表現を使用して (またはすべての xml タグを削除して) 要素テキストを取得することを考えましたが、ここで何が起こっているのかを理解したいと思います。ご協力いただきありがとうございます。

score 3 · Accepted Answer

これは、ElementTreeモデルに「私の言葉は石炭を運ばない」というテキストがあるためです。ではなく要素と見なされtailます。要素のテキストとその子の末尾を取得するには、次の方法を試すことができます。<seg>text<ab>

for i in doc.iter(tag='{http://www.tei-c.org/ns/1.0}ab'): 
    innerText = i.text+''.join((text.tail or '') for text in i.iter()).strip()  
    print(i.tag, innerText)

python - Python 3 で TEI をトラバースすると、一部のエンティティでテキストが空になる

1 に答える 1

Related

Reference