python - Pythonでxmlドキュメントからテキストを抽出します

Question

これはサンプルのxmlドキュメントです：

<bookstore>
    <book category="COOKING">
        <title lang="english">Everyday Italian</title>
        <author>Giada De Laurentiis</author>
        <year>2005</year>
        <price>300.00</price>
    </book>

    <book category="CHILDREN">
        <title lang="english">Harry Potter</title>
        <author>J K. Rowling </author>
        <year>2005</year>
        <price>625.00</price>
    </book>
</bookstore>

そのようなドキュメントが10個あるので、要素を指定せずにテキストを抽出したいのですが、どうすればこれを実行できますか。私の問題は、ユーザーが私が知らない単語を入力していることであるため、それぞれのテキスト部分にある10個のxmlドキュメントすべてで検索する必要があるためです。これを実現するには、要素について知らなくても、テキストがどこにあるかを知る必要があります。これらすべてのドキュメントが異なるというもう1つのこと。

助けてください！！

score 2 · Accepted Answer

xpathクエリでlxmlライブラリを使用することが可能です。

xml="""<bookstore>
    <book category="COOKING">
        <title lang="english">Everyday Italian</title>
        <author>Giada De Laurentiis</author>
        <year>2005</year>
        <price>300.00</price>
    </book>

    <book category="CHILDREN">
        <title lang="english">Harry Potter</title>
        <author>J K. Rowling </author>
        <year>2005</year>
        <price>625.00</price>
    </book>
</bookstore>
"""
from lxml import etree
root = etree.fromstring(xml).getroot()
root.xpath('/bookstore/book/*/text()')
# ['Everyday Italian', 'Giada De Laurentiis', '2005', '300.00', 'Harry Potter', 'J K. Rowling ', '2005', '625.00']

カテゴリは取得できませんが…。

score 0 · Accepted Answer

Python内からgrepを呼び出したい場合は、ここの説明、特にこの投稿を参照してください。

ディレクトリ内のすべてのファイルを検索する場合は、globモジュールを使用して次のようなことを試すことができます。

import glob    
import os    
import re    

p = re.compile('>.*<')    
os.chdir("./")    
for files in glob.glob("*.xml"):    
    file = open(files, "r")    
    line = file.read()    
    list =  map(lambda x:x.lstrip('>').rstrip('<'), p.findall(line))    
    print list    
    print

この検索は、ディレクトリ内のすべてのファイルを繰り返し処理し、各ファイルを開いて、正規表現に一致するテキストを抽出します。

出力：

['Everyday Italian', 'Giada De Laurentiis', '2005', '300.00', 'Harry Potter', 'J
 K. Rowling ', '2005', '625.00']

編集：xmlからテキスト要素のみを抽出するようにコードを更新しました。

score -1 · Accepted Answer

タグを削除するだけです。

>>> import re
>>> txt = """<bookstore>
...     <book category="COOKING">
...         <title lang="english">Everyday Italian</title>
...         <author>Giada De Laurentiis</author>
...         <year>2005</year>
...         <price>300.00</price>
...     </book>
...
...     <book category="CHILDREN">
...         <title lang="english">Harry Potter</title>
...         <author>J K. Rowling </author>
...         <year>2005</year>
...         <price>625.00</price>
...     </book>
... </bookstore>"""
>>> exp = re.compile(r'<.*?>')
>>> text_only = exp.sub('',txt).strip()
>>> text_only
'Everyday Italian\n        Giada De Laurentiis\n        2005\n        300.00\n
  \n\n    \n        Harry Potter\n        J K. Rowling \n        2005\n        6
25.00'

ただし、Linuxでテキストのファイルを検索するだけの場合は、次を使用できますgrep。

burhan@sandbox:~$ grep "Harry Potter" file.xml
        <title lang="english">Harry Potter</title>

grepファイルを検索する場合は、上記のコマンドを使用するか、ファイルを開いてPythonで検索します。

>>> import re
>>> exp = re.compile(r'<.*?>')
>>> with open('file.xml') as f:
...     lines = ''.join(line for line in f.readlines())
...     text_only = exp.sub('',lines).strip()
...
>>> if 'Harry Potter' in text_only:
...    print 'It exists'
... else:
...    print 'It does not'
...
It exists

python - Pythonでxmlドキュメントからテキストを抽出します

3 に答える 3

Related

Reference