python - lxml.etreeを使用して名前空間付きのxml要素にテキストを検索する

Question

lxml.etree を使用して XML ファイルを解析し、テキストを検索して XML の要素にしようとしています。

XML ファイルは次のようになります。

<?xml version="1.0" encoding="UTF-8"?>
<OAI-PMH xmlns="http://www.openarchives.org/OAI/2.0/" 
     xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
     xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/
     http://www.openarchives.org/OAI/2.0/OAI-PMH.xsd">
 <responseDate>2002-06-01T19:20:30Z</responseDate> 
 <request verb="ListRecords" from="1998-01-15"
      set="physics:hep"
      metadataPrefix="oai_rfc1807">
      http://an.oa.org/OAI-script</request>
 <ListRecords>
  <record>
    <header>
      <identifier>oai:arXiv.org:hep-th/9901001</identifier>
      <datestamp>1999-12-25</datestamp>
      <setSpec>physics:hep</setSpec>
      <setSpec>math</setSpec>
    </header>
    <metadata>
     <rfc1807 xmlns=
    "http://info.internet.isi.edu:80/in-notes/rfc/files/rfc1807.txt" 
      xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
      xsi:schemaLocation=
       "http://info.internet.isi.edu:80/in-notes/rfc/files/rfc1807.txt
    http://www.openarchives.org/OAI/1.1/rfc1807.xsd">
    <bib-version>v2</bib-version>
    <id>hep-th/9901001</id>
    <entry>January 1, 1999</entry>
    <title>Investigations of Radioactivity</title>
    <author>Ernest Rutherford</author>
    <date>March 30, 1999</date>
     </rfc1807>
    </metadata>
    <about>
      <oai_dc:dc 
      xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" 
      xmlns:dc="http://purl.org/dc/elements/1.1/" 
      xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
      xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ 
      http://www.openarchives.org/OAI/2.0/oai_dc.xsd">
    <dc:publisher>Los Alamos arXiv</dc:publisher>
    <dc:rights>Metadata may be used without restrictions as long as 
       the oai identifier remains attached to it.</dc:rights>
      </oai_dc:dc>
    </about>
  </record>
  <record>
    <header status="deleted">
      <identifier>oai:arXiv.org:hep-th/9901007</identifier>
      <datestamp>1999-12-21</datestamp>
    </header>
  </record>
 </ListRecords>
</OAI-PMH>

次の部分ではdoc = etree.parse("/tmp/test.xml")、上記で貼り付けた xml が text.xml に含まれている場所を想定しています。

<record>まず、を使用してすべての要素を見つけようとしdoc.findall(".//record")ますが、空のリストが返されます。

次に、特定の単語について、それがにあるかどうかを確認したいと思い<dc:publisher>ます。これを達成するために、最初に以前と同じことを試みます:doc.findall(".//publisher")しかし、私は同じ問題を抱えています...これはすべて名前空間にリンクされていると確信していますが、それらを処理する方法がわかりません。

私は libxmlチュートリアルfindallを読み、基本的な xml ファイル (名前空間なし) でメソッドの例を試してみましたが、うまくいきました。

score 6 · Accepted Answer

Chris が既に述べたように、lxml と xpath も使用できます。{http://www.openarchives.org/OAI/2.0/}recordxpath では(いわゆる「James Clark 記法」*) のように名前空間の名前を完全に記述することはできないため、接頭辞を使用し、xpath エンジンに接頭辞から名前空間への URI マッピングを提供する必要があります。

lxml を使用した例 (目的のtreeオブジェクトが既にあると仮定):

nsmap = {'oa':'http://www.openarchives.org/OAI/2.0/', 
         'dc':'http://purl.org/dc/elements/1.1/'}
tree.xpath('//oa:record[descendant::dc:publisher[contains(., "Alamos")]]',
            namespaces=nsmap)

これにより、「Alamos」という単語を含む{http://www.openarchives.org/OAI/2.0/}record子孫要素を持つすべての要素が選択されます。{http://purl.org/dc/elements/1.1/}dc

[*] これは、James Clark が XML 名前空間について説明している記事からのものです。名前空間に詳しくない人は、これを読むべきです! （かなり前に書いたものですが）

score 4 · Accepted Answer

免責事項: 私は lxml ライブラリではなく、標準ライブラリ xml.etree.ElementTree モジュールを使用しています (ただし、これは私が知る限り lxml のサブセットです)。lxmlとXPATHを使用する私のものよりもはるかに簡単な答えがあると確信していますが、私はそれを知りません。

名前空間の問題

問題は名前空間にある可能性が高いと言ったのは正しかった。recordXML ファイルには要素がありませんが{http://www.openarchives.org/OAI/2.0/}record、ファイルには 2 つのタグがあります。以下に示すように：

>>> import xml.etree.ElementTree as etree

>>> xml_string = ...Your XML to parse...
>>> e = etree.fromstring(xml_string)

# Let's see what the root element is
>>> e
<Element {http://www.openarchives.org/OAI/2.0/}OAI-PMH at 7f39ebf54f80>

# Let's see what children there are of the root element
>>> for child in e:
...     print child
...
<Element {http://www.openarchives.org/OAI/2.0/}responseDate at 7f39ebf54fc8>
<Element {http://www.openarchives.org/OAI/2.0/}request at 7f39ebf58050>
<Element {http://www.openarchives.org/OAI/2.0/}ListRecords at 7f39ebf58098>

# Finally, let's get the children of the `ListRecords` element
>>> for child in e[-1]:
...     print child
... 
<Element {http://www.openarchives.org/OAI/2.0/}record at 7f39ebf580e0>
<Element {http://www.openarchives.org/OAI/2.0/}record at 7f39ebf58908>

たとえば、

>>> e.find('ListRecords')

を返しますがNone、

>>> e.find('{http://www.openarchives.org/OAI/2.0/}ListRecords'
<Element {http://www.openarchives.org/OAI/2.0/}ListRecords at 7f39ebf58098>

要素を返しますListRecords。

find標準ライブラリ ElementTree にはメソッドがないため、メソッドを使用していることに注意してくださいxpath。

考えられる解決策

これを解決し、名前空間プレフィックスを取得して、これを検索しようとしているタグの先頭に追加する 1 つの方法。使用できます

>>>> e.tag[:e.tag.index('}')+1]
'{http://www.openarchives.org/OAI/2.0/}'

ルート要素でe、名前空間を見つけますが、これを行うより良い方法があると確信しています。

これで、オプションの名前空間プレフィックスとして必要なタグを抽出する関数を定義できます。

def findallNS(element, tag, namespace=None):

    if namspace is not None:
        return element.findall(namepsace+tag)
    else:
        return element.findall(tag)

def findNS(element, tag, namespace=None):

    if namspace is not None:
        return element.find(namepsace+tag)
    else:
        return element.find(tag)

したがって、次のように記述できます。

>>> list_records = findNS(e, 'ListRecords', namespace)
>>> findallNS(list_records, 'record', namespace)
[<Element {http://www.openarchives.org/OAI/2.0/}record at 7f39ebf580e0>, 
<Element {http://www.openarchives.org/OAI/2.0/}record at 7f39ebf58908>]

代替ソリューション

別の解決策として、関心のあるタグで終わるすべてのタグを検索する関数を作成することもできます。次に例を示します。

def find_child_tags(element, tag):
    return [child for child in element if child.tag.endswith(tag)]

ここでは、名前空間を扱う必要はまったくありません。

score 2 · Accepted Answer

@Chrisの回答は非常に優れており、それもうまくいきlxmlます。( の代わりにを使用lxmlしても同じように機能します)を使用する別の方法を次に示します。xpathfind

In [37]: xml.find('.//n:record', namespaces={'n': 'http://www.openarchives.org/OAI/2.0/'})
Out[37]: <Element {http://www.openarchives.org/OAI/2.0/}record at 0x2a451e0>

python - lxml.etreeを使用して名前空間付きのxml要素にテキストを検索する

3 に答える 3

名前空間の問題

考えられる解決策

代替ソリューション

Related

Reference