python - Pythonを使用してHadoopでxmlファイルを処理する

Question

私はxmlファイルを処理するためにhadoopでpythonを使用しています、私は以下の形式のxmlファイルを持っていました

temporary.xml

<report>
<report-name name="ALL_TIME_KEYWORDS_PERFORMANCE_REPORT"/>
<date-range date="All Time"/>
  <table>
    <columns>
       <column name="campaignID" display="Campaign ID"/>
       <column name="adGroupID" display="Ad group ID"/>
       <column name="keywordID" display="Keyword ID"/>
       <column name="keyword" display="Keyword"/>
    </columns>
    <row campaignID="79057390" adGroupID="3451305670" keywordID="3000000" keyword="Content"/>
    <row campaignID="79057390" adGroupID="3451305670" keywordID="3000000" keyword="Content"/>
    <row campaignID="79057390" adGroupID="3451305670" keywordID="3000000" keyword="Content"/>
    <row campaignID="79057390" adGroupID="3451305670" keywordID="3000000" keyword="Content"/>
  </table>
</report>

今私がしたいのは、上記のxmlファイルを処理し、後でデータをMSSQLデータベースに保存することだけです。

mapper.pyコード

import sys
import cStringIO
import xml.etree.ElementTree as xml

if __name__ == '__main__':
    buff = None
    intext = False
    for line in sys.stdin:
        line = line.strip()
        if line.find("<row>") != -1:
            intext = True
            buff = cStringIO.StringIO()
            buff.write(line)
        elif line.find("</>") != -1:
            intext = False
            buff.write(line)
            val = buff.getvalue()
            buff.close()
            buff = None
            print val

ここで私がしたいのrow tagsは、そこからデータをフェッチし、それらの値を出力して、それcampaignID,adgroupID,keywordID,keywordを入力として出力することreducer.pyです（これは、データベースにデータを保存するためのコードで構成されています）。

私はいくつかの例を調べましたが、タグはのようなもの<tag> </tag>ですが、私の場合は<row/>

しかし、上記の私のコードは機能していません/何も印刷していません。誰かが私のコードを修正し、行タグから値/データを取得するために必要なPythonコードを追加してください（私はhadoopに非常に新しいです）。次回からのコード。

score 0 · Accepted Answer

xpathの使用を検討しましたか？これは、xmlツリーを回避するために使用できるミニ言語です。Python内から簡単に使用できます。

http://docs.python.org/2/library/xml.etree.elementtree.htmlが役立つかもしれません

ElementTreeでXPathを使用したヘルプの必要性も確認することをお勧めします

これが私がそれを行う方法です（これは有効なPythonコードです。Python3.2でテストしました。サンプルxmlで正常に動作します）：

import xml.etree.ElementTree as xml #you had this line in your code. I am not using any tool you  do not have access to in your script

def get_row_attributes(the_xml_as_a_string):
    """
    this function takes xml as a string. 
    It can work with xml that looks like your included example xml.
    This function returns a list of dictionaries. Each dictionary is made up of the attributes of each row. So the result looks like:
     [
          {attribute_name:value_for_first_row,attribute_name:value_for_first_row...},
          {attribute_name:value_for_second_row,attribute_name:value_for_second_row...},
          etc
     ]
    """
    tree = xml.fromstring(the_xml_as_a_string)
    rows = tree.findall('table/row')  # 'table/row' is xpath. it means get all the rows in all the tables
    return [row.attrib for row in rows]

この関数を使用するには、stdを読み込んで、文字列を作成します。電話get_row_attributes(the_xml_as_a_string)

結果の辞書には、要求した情報（行の属性）が含まれます。

だから今私たちは持っています

std-inからのものを読む
すべての行に関するすべての情報を取得しました

すべて完全に通常のPythonを使用しています

最後に行うことは、それを他のプロセスに書き込むことです。この部分についてサポートが必要な場合は、データの形式と保存先に関する情報を含めてください

python - Pythonを使用してHadoopでxmlファイルを処理する

1 に答える 1

Related

Reference