python - Pythonを使用して多くのXMLファイルのタグを無視する方法

Question

たくさんのテキストを含む xml ファイルがたくさんあります。このテキストは、小文字にして句読点を削除する必要があります。しかし、Pythonを使用してすべてのタグを無視するように言う方法がわかりません。

ElementTree という xml パーサーを見つけました。タグを見つけるための正規表現があります。 pattern = re.compile ('<[^<]*?>')

テストしたところ、最初のタグのテキストのみが表示されます（という名前のタグがたくさんあります）。なんで？

すべてのタグを取得するために、文字列でテストして別のテストを行います。

text = "<root> <test>aaaaaaa </test> <test2> bbbbbbbbb </test2> </root> <root> <test3> cccccc </test3> <test4> ddddd </test4> </root>"
pattern = re.compile ('<[^<]*?>')
tmp = pattern.findall(content, re.DOTALL)

そしてそれは私に与えます：

['</test>', '<test2>', '</test2>', '</root>', '<root>', '<test3>', '</test3>', '<test4>', '</test4>', '</root>']

なぜそうしない<root> <test>のですか？

score 7 · Accepted Answer

実際にElementTreeを使用しているようには見えません。

ElementTree の使用例を次に示します。

import xml.etree.ElementTree as ET
tree = ET.parse('country_data.xml')
root = tree.getroot()

再帰を使用して、関数を介してすべてのタグを実行し、それらをクリーンアップできます。

def clean_tag(tag):
    for child in tag:
        clean_tag(child)
    if tag.text != None:
        # add your code to do lowercase and punctuation here
        tag.text = tag.text.lower()

clean_tag(tree.getroot())
clean_xml = ET.tostring(tree)

python - Pythonを使用して多くのXMLファイルのタグを無視する方法

1 に答える 1

Related

Reference