python - Python: xml から辞書に情報を抽出する

Question

xml ファイルから情報を抽出し、前後の xml タグから情報を分離し、その情報を辞書に保存してから、辞書をループしてリストを出力する必要があります。私はまったくの初心者なので、できるだけ簡単に説明したいと思います。やりたいことを説明した方法があまり意味をなさない場合は申し訳ありません。

これが私がこれまでに持っているものです。

for line in open("/people.xml"):
if "name" in line:
    print (line)
if "age" in line:
    print(line)

現在の出力:

     <name>John</name>

  <age>14</age>

    <name>Kevin</name>

  <age>10</age>

    <name>Billy</name>

  <age>12</age>

望ましい出力

Name          Age
John          14
Kevin         10
Billy         12

編集-以下のコードを使用して、出力を取得できます。

{'Billy': '12', 'John': '14', 'Kevin': '10'}

これから目的の出力のようなヘッダーを持つチャートに到達する方法を知っている人はいますか?

score 3 · Accepted Answer

xmldictを試してください(xml を python 辞書に、またはその逆に変換します):

>>> xmldict.xml_to_dict('''
... <root>
...   <persons>
...     <person>
...       <name first="foo" last="bar" />
...     </person>
...     <person>
...       <name first="baz" last="bar" />
...     </person>
...   </persons>
... </root>
... ''')
{'root': {'persons': {'person': [{'name': {'last': 'bar', 'first': 'foo'}}, {'name': {'last': 'bar', 'first': 'baz'}}]}}}


# Converting dictionary to xml 
>>> xmldict.dict_to_xml({'root': {'persons': {'person': [{'name': {'last': 'bar', 'first': 'foo'}}, {'name': {'last': 'bar', 'first': 'baz'}}]}}})
'<root><persons><person><name><last>bar</last><first>foo</first></name></person><person><name><last>bar</last><first>baz</first></name></person></persons></root>'

またはxmlmapper（親子関係を持つpython辞書のリスト）を試してください：

  >>> myxml='''<?xml version='1.0' encoding='us-ascii'?>
          <slideshow title="Sample Slide Show" date="2012-12-31" author="Yours Truly" >
          <slide type="all">
              <title>Overview</title>
              <item>Why
                  <em>WonderWidgets</em>
                     are great
                  </item>
                  <item/>
                  <item>Who
                  <em>buys</em>
                  WonderWidgets1
              </item>
          </slide>
          </slideshow>'''
  >>> x=xml_to_dict(myxml)
  >>> for s in x:
          print s
  >>>
  {'text': '', 'tail': None, 'tag': 'slideshow', 'xmlinfo': {'ownid': 1, 'parentid': 0}, 'xmlattb': {'date': '2012-12-31', 'author': 'Yours Truly', 'title': 'Sample Slide Show'}}
  {'text': '', 'tail': '', 'tag': 'slide', 'xmlinfo': {'ownid': 2, 'parentid': 1}, 'xmlattb': {'type': 'all'}}
  {'text': 'Overview', 'tail': '', 'tag': 'title', 'xmlinfo': {'ownid': 3, 'parentid': 2}, 'xmlattb': {}}
  {'text': 'Why', 'tail': '', 'tag': 'item', 'xmlinfo': {'ownid': 4, 'parentid': 2}, 'xmlattb': {}}
  {'text': 'WonderWidgets', 'tail': 'are great', 'tag': 'em', 'xmlinfo': {'ownid': 5, 'parentid': 4}, 'xmlattb': {}}
  {'text': None, 'tail': '', 'tag': 'item', 'xmlinfo': {'ownid': 6, 'parentid': 2}, 'xmlattb': {}}
  {'text': 'Who', 'tail': '', 'tag': 'item', 'xmlinfo': {'ownid': 7, 'parentid': 2}, 'xmlattb': {}}
  {'text': 'buys', 'tail': 'WonderWidgets1', 'tag': 'em', 'xmlinfo': {'ownid': 8, 'parentid': 7}, 'xmlattb': {}}

上記のコードはジェネレーターを提供します。それを繰り返すとき。dictキーで情報を取得します。のような、、、tagおよびtext追加情報。ここで要素はとして情報を持ちます。xmlattbtailxmlinforootparentid0

score 1 · Accepted Answer

これにはXML パーサーを使用します。例えば、

import xml.etree.ElementTree as ET
doc = ET.parse('people.xml')
names = [name.text for name in doc.findall('.//name')]
ages = [age.text for age in doc.findall('.//age')]
people = dict(zip(names,ages))
print(people)
# {'Billy': '12', 'John': '14', 'Kevin': '10'}

score 0 · Accepted Answer

これは、単にライブラリをバッグから取り出して実行するのではなく、この XML を手動で解析する方法を学習する演習のように思えます。私が間違っている場合は、Steve Huffman による udacity のビデオをご覧になることをお勧めします。彼は、minidom モジュールを使用して、これらのような軽量の xml ファイルを解析する方法を説明しています。

さて、私の答えで最初に言いたいことは、これらの値をすべて出力するために Python 辞書を作成したくないということです。Python ディクショナリは、値に対応するキーのセットです。それらには順序付けがないため、ファイルに表示された順序でトラバーサルするのは面倒です。すべての名前とそれに対応する年齢を出力しようとしているので、タプルのリストのようなデータ構造の方が、データの照合に適しているでしょう。

あなたのxmlファイルの構造は、各名前タグの後にそれに対応する年齢タグが続いているようです。また、1 行に 1 つの名前タグしかないようです。これにより、問題はかなり単純になります。この問題に対する最も効率的または普遍的な解決策を書くつもりはありませんが、その代わりに、コードをできるだけ簡単に理解できるようにしようと思います。

それでは、最初にデータを格納するリストを作成しましょう。

次に、データを格納するリストを作成しましょう: a_list = []

ファイルを開き、いくつかの変数を初期化して、それぞれの名前と年齢を保持します。

from __future__ import with_statement

with open("/people.xml") as f:
    name, age = None, None #initialize a name and an age variable to be used during traversals.
    for line in f:
        name = extract_name(line,name) # This function will be defined later.
        age = extract_age(line) # So will this one.
        if age: #We know that if age is defined, we can add a person to our list and reset our variables
            a_list.append( (name,age) ) # and now we can re-initialize our variables.
            name,age = None , None # otherwise simply read the next line until age is defined.

ここで、ファイルの各行について、ユーザーが含まれているかどうかを判断したいと考えました。もしそうなら、名前を抽出したかったのです。これを行うために使用される関数を作成しましょう。

def extract_name(a_line,name): #we pass in the line as well as the name value that that we defined before beginning our traversal.
    if name: # if the name is predefined, we simply want to keep the name at its current value. (we can clear it upon encountering the corresponding age.)
        return name
    if not "<name>" in a_line: #if no "<name>" in a_line, return. otherwise, extract new name.
        return
    name_pos = a_line.find("<name>")+6
    end_pos = a_line.find("</name>")
    return a_line[name_pos:end_pos]

ここで、ユーザーの年齢の行を解析する関数を作成する必要があります。これは前の関数と同様の方法で実行できますが、年齢を取得するとすぐにリストに追加されることがわかっています。そのため、年齢の以前の値を気にする必要はありません。したがって、関数は次のようになります。

def extract_age(a_line):
    if not "<age>" in a_line: #if no "<age>" in a_line:
        return
    age_pos = a_line.find("<age>")+5 # else extract age from line and return it.
    end_pos = a_line.find("</age>")
    return a_line[age_pos:end_pos]

最後に、リストを印刷します。次のように実行できます。

for item in a_list:
    print '\t'.join(item)

これが役に立ったことを願っています。私は自分のコードをテストしていないので、まだ少しバグがあるかもしれません. ただし、概念はそこにあります。:)

python - Python: xml から辞書に情報を抽出する

4 に答える 4

Related

Reference