python - Element Tree を使用した xml アーカイブの解析で問題が発生しました

Question

ここではPython +プログラミング初心者なので、我慢しなければならないかもしれません。多数の xml ファイル (RSS アーカイブ) があり、それらからニュース記事の URL を抽出したいと考えています。私はWindowsでPython 2.7.3を使用しています...そして、これが私が見ているコードの例です:

<feed xmlns:media="http://search.yahoo.com/mrss/" xmlns:gr="http://www.google.com/schemas/reader/atom/" xmlns:idx="urn:atom-extension:indexing" xmlns="http://www.w3.org/2005/Atom" idx:index="no" gr:dir="ltr">
<!-- 
Content-type: Preventing XSRF in IE.

 -->
<generator uri="http://www.google.com/reader">Google Reader</generator>
<id>
tag:google.com,2005:reader/feed/http://feeds.smh.com.au/rssheadlines/national.xml
</id>
<title>The Sydney Morning Herald National Headlines</title>
<subtitle type="html">
The top National headlines from The Sydney Morning Herald. For all the news, visit http://www.smh.com.au.
</subtitle>
<gr:continuation>CJPL-LnHybcC</gr:continuation>
<link rel="self" href="http://www.google.com/reader/atom/feed/http://feeds.smh.com.au/rssheadlines/national.xml?n=1000&c=%5BC%5D"/>
<link rel="alternate" href="http://www.smh.com.au/national" type="text/html"/>
<updated>2013-06-16T07:55:56Z</updated>
<entry gr:is-read-state-locked="true" gr:crawl-timestamp-msec="1371369356359">
<id gr:original-id="http://news.smh.com.au/breaking-news-sport/daley-opts-for-dugan-for-origin-two-20130616-2oc5k.html">tag:google.com,2005:reader/item/dabe358abc6c18c5</id>
<category term="user/03956512242887934409/state/com.google/read" scheme="http://www.google.com/reader/" label="read"/>
<title type="html">Daley opts for Dugan for Origin two</title>
<published>2013-06-16T07:12:11Z</published>
<updated>2013-06-16T07:12:11Z</updated>
<link rel="alternate" href="http://rss.feedsportal.com/c/34697/f/644122/s/2d5973e2/l/0Lnews0Bsmh0N0Bau0Cbreaking0Enews0Esport0Cdaley0Eopts0Efor0Edugan0Efor0Eorigin0Etwo0E20A130A6160E2oc5k0Bhtml/story01.htm" type="text/html"/>

具体的には、「元のID」リンクを抽出したい:

<id gr:original-id="http://news.smh.com.au/breaking-news-sport/daley-opts-for-dugan-for-origin-two-20130616-2oc5k.html">tag:google.com,2005:reader/item/dabe358abc6c18c5</id>

私はもともとこれに BeautifulSoup を使用しようとしましたが、問題が発生しました。私が行った調査によると、Element Tree が適しているようです。最初に私が試したETで：

import xml.etree.ElementTree as ET
tree = ET.parse('thefile.xml')
root = tree.getroot()

#first_original_id = root[8][0]

parents_of_interest = root[8::]

for elem in parents_of_interest:
    print elem.items()[0][1]

私ができる限りparents_of_interest、必要なデータを（辞書のリストとして）取得しますが、forループは一連のステートメントのみを返しますtrue。ドキュメントとSOを読んだ後、これは間違ったアプローチのようです。

これには私が探している答えがあると思いますが、それは良い説明ですが、自分の状況に適用できないようです. その答えから私は試しました：

print tree.find('//{http://www.w3.org/2005/Atom}entry}id').text

しかし、エラーが発生しました：

__main__:1: FutureWarning: This search is broken in 1.3 and earlier, and will be fixed in a future version.  If you rely
 on the current behaviour, change it to './/{http://www.w3.org/2005/Atom}entry}id'
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'NoneType' object has no attribute 'text'

これについての助けをいただければ幸いです...そして、それが冗長な質問である場合は申し訳ありません...しかし、すべてを詳しく説明すると思いました...念のため。

score 0 · Accepted Answer

xpath 式は、探しているものではなく最初の id と一致し、original-id は要素の属性であるため、次のように記述する必要があります。

idelem = tree.find('./{http://www.w3.org/2005/Atom}entry/{http://www.w3.org/2005/Atom}id')
if idelem is not None:
    print idelem.get('{http://www.google.com/schemas/reader/atom/}original-id')

最初に一致する id のみが検索されます。それらすべてが必要な場合は、結果を使用findallして反復します。

python - Element Tree を使用した xml アーカイブの解析で問題が発生しました

1 に答える 1

Related

Reference