python - `pythons etree.iterparse()` を使用した巨大な xml ファイルの解析が正しく機能しません。コードに論理エラーがありますか?

Question

巨大なファイル xml-file を解析したい。この巨大なファイルのレコードは、たとえば次のようになります。そして、一般的に、ファイルは次のようになります

<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE dblp SYSTEM "dblp.dtd">
<dblp>
    record_1
    ...
    record_n
</dblp>

このファイルから録音を選択するためのコードをいくつか書きました。

コードを実行すると (MySQL データベースへの保存を含めて 50 分近くかかります) 、100 万近くの作成者がいると思われるレコードがあることに気付きます。これは間違っているに違いありません。ファイルを調べて、ファイルにエラーがないことを確認しました。この論文の著者は 5 人か 6 人しかいないので、dblp.xml については問題ありません。したがって、コードに論理エラーがあると想定しています。しかし、これがどこにあるのかわかりません。おそらく、エラーがどこにあるのか、誰かが教えてくれますか?

コードは行で停止しますif len(auth) > 2000。

import sys
import MySQLdb
from lxml import etree


elements = ['article', 'inproceedings', 'proceedings', 'book', 'incollection']
tags = ["author", "title", "booktitle", "year", "journal"]


def fast_iter(context, cursor):
    mydict = {} # represents a paper with all its tags.
    auth = [] # a list of authors who have written the paper "together".
    counter = 0 # counts the papers

    for event, elem in context:
        if elem.tag in elements and event == "start":
            mydict["element"] = elem.tag
            mydict["mdate"] = elem.get("mdate")
            mydict["key"] = elem.get("key")

        elif elem.tag == "title" and elem.text != None:
            mydict["title"] = elem.text
        elif elem.tag == "booktitle" and elem.text != None:
            mydict["booktitle"] = elem.text
        elif elem.tag == "year" and elem.text != None:
            mydict["year"] = elem.text
        elif elem.tag == "journal" and elem.text != None:
            mydict["journal"] = elem.text
        elif elem.tag == "author" and elem.text != None:
            auth.append(elem.text)
        elif event == "end" and elem.tag in elements:
            counter += 1
            print counter
            #populate_database(mydict, auth, cursor)
            mydict.clear()
            auth = []
            if mydict or auth:
                sys.exit("Program aborted because auth or mydict was not deleted properly!")
        if len(auth) > 200: # There are up to ~150 authors per paper. 
            sys.exit("auth: It seams there is a paper which has too many authors.!")
        if len(mydict) > 50: # A paper can have much metadata.
            sys.exit("mydict: It seams there is a paper which has too many tags.")

        elem.clear()
        while elem.getprevious() is not None:
            del elem.getparent()[0]
    del context


def main():
        cursor = connectToDatabase()
        cursor.execute("""SET NAMES utf8""")

        context = etree.iterparse(PATH_TO_XML, dtd_validation=True, events=("start", "end"))
        fast_iter(context, cursor)

        cursor.close()


if __name__ == '__main__':
    main()

編集：

この関数を書いたとき、私は完全に見当違いでした。私は見落として大きな間違いを犯しました. ファイルのある時点で、100 万近くのレコードを連続してスキップしたところ、次の必要なレコードが爆破されました。

John と Paul の助けを借りて、なんとかコードを書き直すことができました。現在解析中であり、うまく処理できるように継ぎ目があります。予期しないエラーが解決されない場合は、また報告します。それ以外の場合は、ご協力いただきありがとうございます。本当に感謝しています！

def fast_iter2(context, cursor):
    elements = set([
        'article', 'inproceedings', 'proceedings', 'book', 'incollection',
        'phdthesis', "mastersthesis", "www"
        ])
    childElements = set(["title", "booktitle", "year", "journal", "ee"])

    paper = {} # represents a paper with all its tags.
    authors = []   # a list of authors who have written the paper "together".
    paperCounter = 0
    for event, element in context:
        tag = element.tag
        if tag in childElements:
            if element.text:
                paper[tag] = element.text
                # print tag, paper[tag]
        elif tag == "author":
            if element.text:
                authors.append(element.text)
                # print "AUTHOR:", authors[-1]
        elif tag in elements:
            paper["element"] = tag
            paper["mdate"] = element.get("mdate")
            paper["dblpkey"] = element.get("key")
            # print tag, element.get("mdate"), element.get("key"), event
            if paper["element"] in ['phdthesis', "mastersthesis", "www"]:
                pass
            else:
                populate_database(paper, authors, cursor)
            paperCounter += 1
            print paperCounter
            paper = {}
            authors = []
            # if paperCounter == 100:
            #     break
            element.clear()
            while element.getprevious() is not None:
                del element.getparent()[0]
    del context

score 3 · Accepted Answer

混乱の原因の 1 つを排除してください。表示されたコードが「2000 を超えるものの数」テストの 1 つで実際につまずくとは言っていません。そうでない場合、問題はデータベース更新コードにあります (あなたが私たちに示していないこと)。

つまずいた場合:

(1) 制限を 2000 から妥当な値 (mydict では約 20 auth、mydict では正確に 7 ) に減らします。

(2) 旅行が発生print repr(mydict); print; print repr(auth)したら、ファイルと照らし合わせて内容を分析します。

余談ですが、iterparse() を使用すると、「開始」イベントが発生したときに elem.text が解析されているとは限りません。実行時間を節約するには、「終了」イベントが発生したときにのみ elem.text にアクセスする必要があります。実際、「開始」イベントが必要な理由はまったくないようです。また、リストを定義しますtagsが、決して使用しません。関数の開始は、より簡潔に記述できます。

def fast_iter(context, cursor):
    mydict = {} # represents a paper with all its tags.
    auth = [] # a list of authors who have written the paper "together".
    counter = 0 # counts the papers
    tagset1 = set(['article', 'inproceedings', 'proceedings', 'book', 'incollection'])
    tagset2 = set(["title", "booktitle", "year", "journal"])
    for event, elem in context:
        tag = elem.tag
        if tag in tagset2:
            if elem.text:
                mydict[tag] = elem.text
        elif tag == "author":
            if elem.text:
                auth.append(elem.text)
        elif tag in tagset1:
            counter += 1
            print counter
            mydict["element"] = tag
            mydict["mdate"] = elem.get("mdate")
            mydict["dblpkey"] = elem.get("key")
            #populate_database(mydict, auth, cursor)
            mydict.clear() # Why not just do mydict = {} ??
            auth = []
            # etc etc

イベント引数を削除するために iterparse() への呼び出しを修正することを忘れないでください。

また、イベントが「終了」の場合にのみ elem.clear() を実行する必要があり、tag in tagset1. 関連するドキュメントを注意深く読んでください。「開始」イベントでクリーンアップを行うと、ツリーが損傷する可能性が非常に高くなります。

python - `pythons etree.iterparse()` を使用した巨大な xml ファイルの解析が正しく機能しません。コードに論理エラーがありますか?

編集：

2 に答える 2

Related

Reference