python - xmlを解析してPythonで次のノードのテキスト値を見つける方法は？

Question

次のサンプル構成XMLファイルがあると仮定します。

<?xml version="1.0"?>
<note> 
    <to>Tove</to> 
    <infoaboutauthor>
      <nestedprofile>
           <aboutme> 
               <gco:CharacterString>I am a 10th grader who likes to play ball.</gco:CharacterString> 
          </aboutme>
      </nestedprofile>
    </infoaboutauthor>
    <date>
        <info_date>
            <date>
               <gco:Date>2003-06-13</gco:Date>
            </date>
            <datetype>
                <datetype attribute="Value">
                </datetype>
            </datetype>
        </info_date>
    </date>
    <from>Jani</from> 
    <heading>Reminder</heading> 
    <body>Don't forget me this weekend!</body> 
  </note>

Pythonでは（ElementTreeを使用してみましたが、最適かどうかはわかりません）、特定のタグの特定の値を取得したいと思います。私が試してみました：

with open('testfile.xml', 'rt') as f:
    tree = ElementTree.parse(f)
print 'Parsing'
root = tree.getroot()
listofelements = root_elem.findall('gco:CharacterString')    
for elementfound in listofelements:
    print elementfound.text

上記で使用したコードでは、次のエラーが発生するため、コロンがある場合は機能しないようです。

SyntaxError: prefix 'gco' not found in prefix map

私の目標は

「2003-06-13」タグのテキストを取得します
「aboutme」タグのテキスト

これを達成するための最良の方法は何ですか？親が「aboutme」と等しい「gco：CharacterString」を検索する方法はありますか？それとも、私が行くことができる口述にそれを入れるためのいくつかの便利な方法はありmydict['note']['to']['nestedprofile']['aboutme']ますか？

注：「gco：」プレフィックスは、xmlの一部である私が処理しなければならないものです。elementtreeがこれに適していない場合、それは問題ありません。

score 1 · Accepted Answer

まず、XML が壊れています。2行目は-パーサーを壊しています。また、sが好きではないと思いますgco:。他の XML 構成を使用できますか? それとも、これはあなたが制御できない何かによって自動的に生成されたものですか?

これを Python で動作させるには、XML が次のようになっている必要があります。

<?xml version="1.0"?>
<note>
    <to>Tove</to>
    <infoaboutauthor>
      <nestedprofile>
           <aboutme>
               <CharacterString>I am a 10th grader who likes to play ball.</CharacterString>
          </aboutme>
      </nestedprofile>
    </infoaboutauthor>
    <date>
        <info_date>
            <date>
               <Date>2003-06-13</Date>
            </date>
            <datetype>
                <datetype attribute="Value">
                </datetype>
            </datetype>
        </info_date>
    </date>
    <from>Jani</from>
    <heading>Reminder</heading>
    <body>Don't forget me this weekend!</body>
  </note>

2 つの目標を達成するためのコードは次のとおりです。

# Get the element tree from the file name and not a file object
tree = ElementTree.parse('config.xml')

# Get the root of the tree
root = tree.getroot()

# To get the 'Date' tag and print its text
date_tag = root.find('date').find('info_date').find('date').find('Date')
print date_tag.text

# Get the `aboutme` tag and print its text
about_me_tag = root.find('infoaboutauthor').find('nestedprofile').find('aboutme').find('CharacterString')
print about_me_tag.text

アップデート

「gco:」を扱う限り、次のようなことができます。

def replace_in_config(old, new):
    with open('config.xml', 'r') as f:
        text = f.read()

    with open('config.xml', 'w') as f:
        f.write(text.replace(old, new))

次に、上記の XML 操作を実行する前に、次を実行します。

replace_in_config('gco:', '_stripped')

gco:Date次に、XMl 操作が完了したら (もちろん、タグがstripped_DateCharacterString タグのようになっているという事実を説明する必要があります)、これを実行します。

replace_in_config('_stripped', 'gco:')

これにより、元の形式が保持され、で解析できるようになりますetree。

score 0 · Accepted Answer

「gco」名前空間が定義されていないため、XML ドキュメントは有効ではないと思います。

parse コマンドの一部として定義を lxml に提供する方法が見つかりません。@mjgpy3 で提案されているように、ドキュメントを操作して定義を追加したり、プレフィックスを削除したりすることができます。

別のアプローチとして、HTML パーサーを使用することもできます。これは、受け入れられるものについてそれほど厳密ではありません。ただし、これによりデータの構造が変更され、HTML ヘッダーなどが追加されることに注意してください。

from lxml import etree

Parser = etree.HTMLParser()
XMLDoc = etree.parse(open('C:/Temp/Test.xml', 'r'), Parser)

Elements = XMLDoc.xpath('//characterstring')

for Element in Elements:
    print Element.text

python - xmlを解析してPythonで次のノードのテキスト値を見つける方法は？

2 に答える 2

Related

Reference