python - PythonでXML階層を解析するには?

Question

私はPythonを初めて使用し、速度を上げるためにさまざまなプロジェクトに取り組んでいます。現時点では、連邦規則集を読み、段落ごとにその段落の組織階層を出力するルーチンに取り組んでいます。たとえば、CFR の XML スキームの単純化されたバージョンは次のようになります。

<CHAPTER>
<HD SOURCE="HED">PART 229—NONDISCRIMINATION ON THE BASIS OF SEX IN EDUCATION PROGRAMS OR ACTIVITIES RECEIVING FEDERAL FINANCIAL ASSISTANCE</HD>
     <SECTION>
        <SECTNO>### 229.120</SECTNO>
        <SUBJECT>Transfers of property.</SUBJECT>
        <P>If a recipient sells or otherwise transfers property (…) subject to the provisions of ### 229.205 through 229.235(a).</P>
     </SECTION>

テキスト分析を実行できるように、これを CSV に出力できるようにしたいと思います。

Title 22、Volume 2、Part 229、Section 228.120、### 229.205 から 229.235(a) の規定に従って、受領者が財産 (…) を売却または譲渡する場合。

Title と Volume の番号は XML から取得していないことに注意してください。これらは実際には、はるかに標準化された形式でファイル名に含まれているからです。

私は Python 初心者なので、コードの大部分は Udacity のコンピューターサイエンスコースの検索エンジンコードに基づいています。これまでに書いた/適応させたPythonは次のとおりです。

import os
import urllib2
from xml.dom.minidom import parseString
file_path = '/Users/owner1/Downloads/CFR-2012/title-22/CFR-2012-title22-vol1.xml'
file_name = os.path.basename(file_path) #Gets the filename from the path.
doc = open(file_path)
page = doc.read()

def clean_title(file_name): #Gets the title number from the filename.
    start_title = file_name.find('title')
    end_title = file_name.find("-", start_title+1)
    title = file_name[start_title+5:end_title]
    return title

def clean_volume(file_name): #Gets the volume number from the filename.
    start_volume = file_name.find('vol')
    end_volume = file_name.find('.xml', start_volume)
    volume = file_name[start_volume+3:end_volume]
    return volume

def get_next_section(page): #Gets all of the text between <SECTION> tags.
    start_section = page.find('<SECTION')
    if start_section == -1:
        return None, 0
    start_text = page.find('>', start_section)
    end_quote = page.find('</SECTION>', start_text + 1)
    section = page[start_text + 1:end_quote]
    return section, end_quote

def get_section_number(section): #Within the <SECTION> tag, find the section number based on the <SECTNO> tag.
    start_section_number = section.find('<SECTNO>###')
    if start_section_number == -1:
        return None, 0
    end_section_number = section.find('</SECTNO>', start_section_number)
    section_number = section[start_section_number+11:end_section_number]
    return section_number, end_section_number

def get_paragraph(section): #Within the <SECTION> tag, finds <P> paragraphs.
    start_paragraph = section.find('<P>')
    if start_paragraph == -1:
        return None, 0
    end_paragraph = section.find('</P>', start_paragraph)
    paragraph = section[start_paragraph+3:end_paragraph]
    return start_paragraph, paragraph, end_paragraph


def print_all_paragraphs(page): #This is the section that I would *like* to have print each paragraph and the citation hierarchy.
    section, endpos = get_next_section(page)
    for pragraph in section:
        title = clean_title(file_name)
        volume = clean_volume(file_name)
        section, endpos = get_next_section(page)
        section_number, end_section_number = get_section_number(section)
        start_paragraph, paragraph, end_paragraph = get_paragraph(section)
        if paragraph:
            print "Title: "+ title + " Volume: "+ volume +" Section Number: "+ section_number + " Text: "+ paragraph
            page = page[end_paragraph:]
        else:
            break

print print_all_paragraphs(page)
doc.close()

現時点では、このコードには次の問題があります (出力例を次に示します)。

最初の段落を複数回印刷します。それぞれどのように印刷すればよいですか
独自のタイトル番号、巻番号などのタグ？
CFR には、「予約済み」の空のセクションがあります。これらのセクションにはありません
タグなので、if ループが壊れます。for/while ループを実装しようとしましたが、何らかの理由でこれを行うと、コードが繰り返し見つかった最初の段落を出力するだけです。

出力の例を次に示します。

Title: 22 Volume: 1 Section Number:  9.10 Text: All requests to the Department by a member 

of the public, a government employee, or an agency to declassify and release information shall result in a prompt declassification review of the information in accordance with procedures set forth in 22 CFR 171.20-25. Mandatory declassification review requests should be directed to the Information and Privacy Coordinator, U.S. Department of State, SA-2, 515 22nd St., NW., Washington, DC 20522-6001.
Title: 22 Volume: 1 Section Number:  9.10 Text: All requests to the Department by a member of the public, a government employee, or an agency to declassify and release information shall result in a prompt declassification review of the information in accordance with procedures set forth in 22 CFR 171.20-25. Mandatory declassification review requests should be directed to the Information and Privacy Coordinator, U.S. Department of State, SA-2, 515 22nd St., NW., Washington, DC 20522-6001.
Title: 22 Volume: 1 Section Number:  9.10 Text: All requests to the Department by a member of the public, a government employee, or an agency to declassify and release information shall result in a prompt declassification review of the information in accordance with procedures set forth in 22 CFR 171.20-25. Mandatory declassification review requests should be directed to the Information and Privacy Coordinator, U.S. Department of State, SA-2, 515 22nd St., NW., Washington, DC 20522-6001.
Title: 22 Volume: 1 Section Number:  9.11 Text: The Information and Privacy Coordinator shall be responsible for conducting a program for systematic declassification review of historically valuable records that were exempted from the automatic declassification provisions of section 3.3 of the Executive Order. The Information and Privacy Coordinator shall prioritize such review on the basis of researcher interest and the likelihood of declassification upon review.
Title: 22 Volume: 1 Section Number:  9.12 Text: For Department procedures regarding the access to classified information by historical researchers and certain former government personnel, see Sec. 171.24 of this Title.
Title: 22 Volume: 1 Section Number:  9.13 Text: Specific controls on the use, processing, storage, reproduction, and transmittal of classified information within the Department to provide protection for such information and to prevent access by unauthorized persons are contained in Volume 12 of the Department's Foreign Affairs Manual.
Title: 22 Volume: 1 Section Number:  9a.1 Text: These regulations implement Executive Order 11932 dated August 4, 1976 (41 FR 32691, August 5, 1976) entitled ‚ÄúClassification of Certain Information and Material Obtained from Advisory Bodies Created to Implement the International Energy Program.‚Äù
Title: 22 Volume: 1 Section Number:  9a.1 Text: These regulations implement Executive Order 11932 dated August 4, 1976 (41 FR 32691, August 5, 1976) entitled ‚ÄúClassification of Certain Information and Material Obtained from Advisory Bodies Created to Implement the International Energy Program.‚Äù
None

理想的には、引用情報の後の各エントリは異なります。

これを正しく印刷するには、どのようなループを実行する必要がありますか? この種のテキスト抽出を行うためのより「pythonic」な方法はありますか?

私は完全な初心者であることを理解しています。私が直面している主な問題の 1 つは、このレベルの詳細で XML を解析することに関する詳細な回答を実際に見つけるための語彙やトピックの知識がないことです。おすすめの読み物も大歓迎です。

score 0 · Accepted Answer

このような問題を XPATH や XSLT で解決するのが好きです。lxml で優れた実装を見つけることができます (標準のディストリビューションではなく、インストールする必要があります)。たとえば、XPATH //CHAPTER/HD/SECTION[SECTNO] は、データを含むすべてのセクションを選択します。そこから必要な値を取得するには、相対 XPATH ステートメントを使用します。複数のネストされた for ループが消えます。XPATH には多少の学習曲線がありますが、多くの例があります。

python - PythonでXML階層を解析するには?

1 に答える 1

Related

Reference