python - Scrapy は一部の html ファイルを正しく解析できませんでした

Question

Scrapyを数週間使用しましたが、最近、HtmlXPathSelectorが一部の html ファイルを適切に解析できないことがわかりました。

Web ページhttp://detail.zol.com.cn/series/268/10227_1.htmlには、という名前のタグしかありません

`div id='param-more' class='mod_param  '`.

xpath "//div[@id='param-more']"を使用してタグを選択すると、[]が返されました。

私はスクレイピーシェルを試しましたが、同じ結果が得られました。

wgetを使用して Web ページを取得すると、html ソースファイルに"div id='param-more' class='mod_param'"というタグも見つかりましたが、タグが表示される理由によるものではないと思いますアクションのトリガー。

この問題を解決する方法のヒントを教えてください。

以下は、問題に関するコードシンペットです。上記の URL を処理する場合、len(nodes_product)は常に0です。

def parse_series(self, response):
    hxs = HtmlXPathSelector(response)

    xpath_product = "//div[@id='param-normal']/table//td[@class='name']/a | "\
                    "//div[@id='param-more']/table//td[@class='name']/a"
    nodes_product = hxs.select(xpath_product)
    if len(nodes_product) == 0:
        # there's only the title, no other products in the series
        .......
    else:
        .......

score 3 · Accepted Answer

これはXPathSelectorsのバグのようです。クイックテストスパイダーを作成し、同じ問題が発生しました。それはページ上の非標準の文字と関係があると思います。

'param-more'divがjavascriptイベントまたはCSS非表示に関連付けられていることが問題だとは思いません。javascriptを無効にし、ユーザーエージェント（および場所）も変更して、これがページのデータに影響するかどうかを確認しました。そうではありませんでした。

ただし、beautifulsoupを使用して「param-more」divを解析することはできました。

from scrapy.selector import HtmlXPathSelector
from scrapy.spider import BaseSpider
from bs4 import BeautifulSoup

class TestSpider(BaseSpider):
    name = "Test"

    start_urls = [
        "http://detail.zol.com.cn/series/268/10227_1.html"
                 ]

    def parse(self, response):
        hxs = HtmlXPathSelector(response)

        #data = hxs.select("//div[@id='param-more']").extract()

        data = response.body
        soup = BeautifulSoup(data)
        print soup.find(id='param-more')

他の誰かがXPathSelectの問題についてもっと知っているかもしれませんが、当面は、beautifulsoupで見つかったHTMLをアイテムに保存して、パイプラインに渡すことができます。

最新のbeautifulsoupバージョンへのリンクは次のとおりです。http ：//www.crummy.com/software/BeautifulSoup/#Download

アップデート

私は特定の問題を見つけたと思います。説明しているWebページは、メタタグでGB2312文字セットを使用することを指定しています。GB 2312からUnicodeへの変換には、 Unicodeに相当する文字がない文字があるため問題があります。これは、beautifulsoupのエンコーディング検出モジュールであるUnicodeDammitが実際にエンコーディングをISO8859-2であると判断するという事実を除いて問題にはなりません。問題は、lxmlがヘッダーのメタタグで指定された文字セットを調べてドキュメントのエンコーディングを決定することです。したがって、lxmlとscrapyが認識するものの間にはエンコーディングタイプの不一致があります。

次のコードは、上記の問題を示しており、BS4ライブラリに依存する代わりの方法を提供します。

from scrapy.selector import HtmlXPathSelector
from scrapy.spider import BaseSpider
from bs4 import BeautifulSoup
import chardet

class TestSpider(BaseSpider):
    name = "Test"

    start_urls = [
        "http://detail.zol.com.cn/series/268/10227_1.html"
                 ]

    def parse(self, response):

        encoding = chardet.detect(response.body)['encoding']
        if encoding != 'utf-8':
            response.body = response.body.decode(encoding, 'replace').encode('utf-8')

        hxs = HtmlXPathSelector(response)
        data = hxs.select("//div[@id='param-more']").extract()
        #print encoding
        print data

ここでは、lxmlにutf-8エンコーディングの使用を強制することにより、GB2312->utf-8として認識されるものからマップしようとしないことがわかります。

Scrapyでは、HTMLXPathSelectorsエンコーディングはscrapy / select/lxmlsel.pyモジュールで設定されます。このモジュールは、response.encoding属性を使用して応答本文をlxmlパーサーに渡します。この属性は、最終的にscrapy / http / response/test.pyモジュールに設定されます。

response.encoding属性の設定を処理するコードは次のとおりです。

@property
def encoding(self):
    return self._get_encoding(infer=True)

def _get_encoding(self, infer=False):
    enc = self._declared_encoding()
    if enc and not encoding_exists(enc):
        enc = None
    if not enc and infer:
        enc = self._body_inferred_encoding()
    if not enc:
        enc = self._DEFAULT_ENCODING
    return resolve_encoding(enc)

def _declared_encoding(self):
    return self._encoding or self._headers_encoding() \
        or self._body_declared_encoding()

ここで注意すべき重要な点は、_headers_encodingと_encodingの両方が、ドキュメントのエンコーディングを決定するために実際にUnicodeDammitやchardetなどを使用するよりも、ヘッダーのメタタグで宣言されたエンコーディングを最終的に反映することです。したがって、ドキュメントに指定されたエンコーディングに対して無効な文字が含まれている状況が発生し、Scrapyはこれを見落とし、最終的には今日発生している問題につながると思います。

score 0 · Accepted Answer

'mod_param ' != 'mod_param'

クラスは「mod_param」と等しくありませんが、「mod_param」が含まれています。末尾に空白があることに注意してください。

stav@maia:~$ scrapy shell http://detail.zol.com.cn/series/268/10227_1.html
2012-08-23 09:17:28-0500 [scrapy] INFO: Scrapy 0.15.1 started (bot: scrapybot)
Python 2.7.3 (default, Aug  1 2012, 05:14:39)
IPython 0.12.1 -- An enhanced Interactive Python.

In [1]: hxs.select("//div[@class='mod_param']")
Out[1]: []

In [2]: hxs.select("//div[contains(@class,'mod_param')]")
Out[2]: [<HtmlXPathSelector xpath="//div[contains(@class,'mod_param')]" data=u'<div id="param-more" class="mod_param  "'>]

In [3]: len(hxs.select("//div[contains(@class,'mod_param')]").extract())
Out[3]: 1

In [4]: len(hxs.select("//div[contains(@class,'mod_param')]").extract()[0])
Out[4]: 5372

python - Scrapy は一部の html ファイルを正しく解析できませんでした

2 に答える 2

Related

Reference