python - BeautifulSoup がこの RSS (XML) ドキュメントを正しく読み取ったり解析したりできないのはなぜですか?

Question

YCombinator は、RSS フィードと、HackerNews のトップアイテムを含む大きな RSS フィードを提供するのに十分です。RSS フィードドキュメントにアクセスし、BeautifulSoup を使用して特定の情報を解析するための Python スクリプトを作成しようとしています。ただし、BeautifulSoup が各アイテムのコンテンツを取得しようとすると、奇妙な動作が発生します。

RSS フィードのいくつかのサンプル行を次に示します。

<rss version="2.0">
<channel>
<title>Hacker News</title><link>http://news.ycombinator.com/</link><description>Links for the intellectually curious, ranked by readers.</description>
<item>
    <title>EFF Patent Project Gets Half-Million-Dollar Boost from Mark Cuban and &#39;Notch&#39;</title>
    <link>https://www.eff.org/press/releases/eff-patent-project-gets-half-million-dollar-boost-mark-cuban-and-notch</link>
    <comments>http://news.ycombinator.com/item?id=4944322</comments>
    <description><![CDATA[<a href="http://news.ycombinator.com/item?id=4944322">Comments</a>]]></description>
</item>
<item>
    <title>Two Billion Pixel Photo of Mount Everest (can you find the climbers?)</title>
    <link>https://s3.amazonaws.com/Gigapans/EBC_Pumori_050112_8bit_FLAT/EBC_Pumori_050112_8bit_FLAT.html</link>
    <comments>http://news.ycombinator.com/item?id=4943361</comments>
    <description><![CDATA[<a href="http://news.ycombinator.com/item?id=4943361">Comments</a>]]></description>
</item>
...
</channel>
</rss>

これは、このフィードにアクセスして各項目のtitle、link、およびを出力するために (Python で) 書いたコードです。comments

import sys
import requests
from bs4 import BeautifulSoup

request = requests.get('http://news.ycombinator.com/rss')
soup = BeautifulSoup(request.text)
items = soup.find_all('item')
for item in items:
    title = item.find('title').text
    link = item.find('link').text
    comments = item.find('comments').text
    print title + ' - ' + link + ' - ' + comments

ただし、このスクリプトは次のような出力を提供します。

EFF Patent Project Gets Half-Million-Dollar Boost from Mark Cuban and &#39;Notch&#39; -  - http://news.ycombinator.com/item?id=4944322
Two Billion Pixel Photo of Mount Everest (can you find the climbers?) -  - http://news.ycombinator.com/item?id=4943361
...

ご覧のとおり、真ん中の項目linkがなぜか省略されています。つまり、の結果の値linkはどういうわけか空の文字列です。では、それはなぜでしょうか。

の内容を調べてみると、XML を解析するときに何かがsoup詰まっていることに気付きました。これは、の最初の項目を見るとわかりますitems。

>>> print items[0]
<item><title>EFF Patent Project Gets Half-Million-Dollar Boost from Mark Cuban and &#39;Notch&#39;</title></link>https://www.eff.org/press/releases/eff-patent-project-gets-half-million-dollar-boost-mark-cuban-and-notch<comments>http://news.ycombinator.com/item?id=4944322</comments><description>...</description></item>

タグだけで何かおかしなことが起こっていることに気付くでしょうlink。終了タグを取得し、その後にそのタグのテキストを取得するだけです。これは、特に問題なく解析されるのtitleとは対照的に、非常に奇妙な動作です。comments

リクエストによって実際に読み込まれるものには問題がないため、これは BeautifulSoup の問題のようです。xml.etree.ElementTree API も使用してみましたが、同じ問題が発生したためです (BeautifulSoup はこの API で構築されていますか?)。

なぜこれが起こるのか、またはこのエラーを発生させずに BeautifulSoup を使用する方法を知っている人はいますか?

注: xml.dom.minidom で最終的に必要なものを取得できましたが、これは強く推奨されるライブラリではないようです。できればBeautifulSoupを使い続けたいです。

更新: Python 2.7.2 および BS4 4.1.3 を使用して、OSX 10.8 を搭載した Mac を使用しています。

更新 2 : lxml があり、pip でインストールされました。バージョン 3.0.2 です。libxml に関しては、/usr/lib をチェックインしたところ、表示されるのは libxml2.2.dylib です。それがいつ、どのようにインストールされたのかはわかりません。

score 7 · Accepted Answer

うわー、素晴らしい質問です。これは BeautifulSoup のバグだと思います。を使用してリンクにアクセスできない理由soup.find_all('item').linkは、最初に HTML を BeautifulSoup に最初にロードするときに、HTML に対して何か奇妙なことを行うためです。

>>> from bs4 import BeautifulSoup as BS
>>> BS(html)
<html><body><rss version="2.0">
<channel>
<title>Hacker News</title><link/>http://news.ycombinator.com/<description>Links
for the intellectually curious, ranked by readers.</description>
<item>
<title>EFF Patent Project Gets Half-Million-Dollar Boost from Mark Cuban and 'No
tch'</title>
<link/>https://www.eff.org/press/releases/eff-patent-project-gets-half-million-d
ollar-boost-mark-cuban-and-notch
    <comments>http://news.ycombinator.com/item?id=4944322</comments>
<description>Comments]]&gt;</description>
</item>
<item>
<title>Two Billion Pixel Photo of Mount Everest (can you find the climbers?)</ti
tle>
<link/>https://s3.amazonaws.com/Gigapans/EBC_Pumori_050112_8bit_FLAT/EBC_Pumori_
050112_8bit_FLAT.html
    <comments>http://news.ycombinator.com/item?id=4943361</comments>
<description>Comments]]&gt;</description>
</item>
...
</channel>
</rss></body></html>

注意深く見てください。実際には、最初の<link>タグがに変更さ<link/>れ、タグが削除されてい</link>ます。なぜこれを行うのかはわかりませんが、BeautifulSoup.BeautifulSoupクラスの初期化の問題を修正しないと、今のところ使用できません。

アップデート：

今のところ（ハックではありますが）最善の策は、次のものを使用することだと思いますlink：

>>> soup.find('item').link.next_sibling
u'http://news.ycombinator.com/'

score 3 · Accepted Answer

実際、問題は使用しているパーサーに関連しているようです。デフォルトでは、HTML が使用されます。lxml モジュールをインストールした後、soup = BeautifulSoup(request.text, 'xml') を使用してみてください。

次に、HTML パーサーの代わりに XML パーサーを使用しますが、問題はありません。

詳細については、 http://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parserを参照してください。

score 1 · Accepted Answer

ここの BeautifulSoup にバグはないと思います。

OS X 10.8.2 から Apple のストック 2.7.2 に BS4 4.1.3 のクリーンコピーをインストールしたところ、すべてが期待どおりに機能しました。<link>as を誤って解析しないため、 .</link>に問題はありませんitem.find('link')。

また、ストックxml.etree.ElementTreeを使用xml.etree.cElementTreeして 2.7.2 とxml.etree.ElementTreepython.org 3.3.0 で同じことを解析しようとしましたが、再び正常に機能しました。コードは次のとおりです。

import xml.etree.ElementTree as ET

rss = ET.fromstring(x)
for channel in rss.findall('channel'):
  for item in channel.findall('item'):
    title = item.find('title').text
    link = item.find('link').text
    comments = item.find('comments').text
    print(title)
    print(link)
    print(comments)

次に、lxml 3.0.2 をインストールし (BS は利用可能な場合は lxml を使用すると思います)、Apple のビルトイン/usr/lib/libxml2.2.dylib(2.7.8 によるとxml2-config --version) を使用し、etree を使用して、BS を使用して同じテストを行いました。すべてが機能しました。

を台無しにすることに加えて<link>、jdotjdot の出力は、BS4 が<description>奇妙な方法でを台無しにしていることを示しています。元はこれです：

<description><![CDATA[<a href="http://news.ycombinator.com/item?id=4944322">Comments</a>]]></description>

彼の出力は次のとおりです。

<description>Comments]]&gt;</description>

彼のまったく同じコードを実行した結果の私の出力は次のとおりです。

<description><![CDATA[<a href="http://news.ycombinator.com/item?id=4944322">Comments</a>]]></description>

だから、ここでもっと大きな問題が起こっているようです。奇妙なことは、何かの最新バージョンのクリーンインストールでは発生していないのに、2 人の異なる人に発生していることです。

これは、修正されたバグであり、バグがあったものの新しいバージョンを持っているか、または両方が何かをインストールした方法が奇妙であることを意味します.

少なくとも Treebranch には私のように 4.1.3 があるため、BS4 自体は除外できます。ただし、インストール方法がわからなくても、インストールに問題がある可能性があります。

少なくとも Treebranch には、私と同じ OS X 10.8 の Apple 2.7.2 の在庫があるため、Python とその組み込みの etree は除外できます。

それは、lxml または基礎となる libxml、またはそれらのインストール方法のバグである可能性が非常に高いです。jdotjdot には lxml 2.3.6 があることを知っているので、これは 2.3.6 と 3.0.2 の間のどこかで修正されたバグである可能性があります。実際、lxml Web サイトと 2.3.5 以降のバージョンの変更ノートによると、2.3.6は存在しないため、彼が持っているものは何でも、キャンセルされたブランチの非常に早い段階からの何らかのバグのあるリリースである可能性があります。 … 彼の libxml のバージョンも、インストール方法も、彼が使用しているプラットフォームもわからないので、推測するのは困難ですが、少なくともこれは調査できるものです。

python - BeautifulSoup がこの RSS (XML) ドキュメントを正しく読み取ったり解析したりできないのはなぜですか?

4 に答える 4

アップデート：

Related

Reference