python - スクレイピーに粘着クッキーを設定する

Question

私がスクレイピングしている Web サイトには、Cookie を設定し、それをバックエンドでチェックして js が有効になっていることを確認する JavaScript があります。HTMLコードからCookieを抽出するのは簡単ですが、それを設定するのはscrapyの問題のようです. だから私のコードは次のとおりです。

from scrapy.contrib.spiders.init import InitSpider

class TestSpider(InitSpider):
    ...
    rules = (Rule(SgmlLinkExtractor(allow=('products/./index\.html', )), callback='parse_page'),)

    def init_request(self):
        return Request(url = self.init_url, callback=self.parse_js)

    def parse_js(self, response):
        match = re.search('setCookie\(\'(.+?)\',\s*?\'(.+?)\',', response.body, re.M)
        if match:
            cookie = match.group(1)
            value = match.group(2)
        else:
            raise BaseException("Did not find the cookie", response.body)
        return Request(url=self.test_page, callback=self.check_test_page, cookies={cookie:value})

    def check_test_page(self, response):
        if 'Welcome' in response.body:
            self.initialized()

    def parse_page(self, response):
        scraping....

コンテンツがで利用可能であることを確認できますcheck_test_page。Cookie は完全に機能します。しかしparse_page、適切な Cookie を持たない CrawlSpider ではリンクが表示されないため、到達することさえありません。スクレイピングセッション中に Cookie を設定する方法はありますか? または、BaseSpider を使用して、手動ですべてのリクエストに Cookie を追加する必要がありますか?

あまり望ましくない代替手段は、何らかの方法でスクレイピー構成ファイルを介して Cookie を設定することです (値は決して変わらないようです)。それは可能ですか？

score 1 · Accepted Answer

私はInitSpider前に使用したことがありません。

コードを見ると、次のようになりますscrapy.contrib.spiders.init.InitSpider。

def initialized(self, response=None):
    """This method must be set as the callback of your last initialization
    request. See self.init_request() docstring for more info.
    """
    self._init_complete = True
    reqs = self._postinit_reqs[:]
    del self._postinit_reqs
    return reqs

def init_request(self):
    """This function should return one initialization request, with the
    self.initialized method as callback. When the self.initialized method
    is called this spider is considered initialized. If you need to perform
    several requests for initializing your spider, you can do so by using
    different callbacks. The only requirement is that the final callback
    (of the last initialization request) must be self.initialized. 

    The default implementation calls self.initialized immediately, and
    means that no initialization is needed. This method should be
    overridden only when you need to perform requests to initialize your
    spider
    """
    return self.initialized()

あなたが書いた：

コンテンツがで利用可能であることを確認できますcheck_test_page。Cookie は完全に機能します。しかし、適切な Cookie がないとリンクが表示されないparse_pageため、アクセスすることさえできません。CrawlSpider

parse_pageコールバックとしてリクエストを作成しなかったため、呼び出されていないと思いself.initializedます。

私はこれがうまくいくと思います：

def check_test_page(self, response):
    if 'Welcome' in response.body:
        return self.initialized()

score 0 · Accepted Answer

InitSpider は BaseSpider であることが判明しました。1) この状況で CrawlSpider を使用する方法がない 2) スティッキー Cookie を設定する方法がない

python - スクレイピーに粘着クッキーを設定する

2 に答える 2

Related

Reference