python - Splash リクエストから Cookie を読み取る

Question

Splash を使用してリクエストを行った後、Cookie にアクセスしようとしています。以下は、リクエストを作成する方法です。

script = """
function main(splash)
  splash:init_cookies(splash.args.cookies)
  assert(splash:go{
    splash.args.url,
    headers=splash.args.headers,
    http_method=splash.args.http_method,
    body=splash.args.body,
    })
  assert(splash:wait(0.5))

  local entries = splash:history()
  local last_response = entries[#entries].response
  return {
    url = splash:url(),
    headers = last_response.headers,
    http_status = last_response.status,
    cookies = splash:get_cookies(),
    html = splash:html(),
  }
end
"""
req = SplashRequest(
    url,
    self.parse_page,
    args={
        'wait': 0.5,
        'lua_source': script,
        'endpoint': 'execute'
    }
)

このスクリプトは、Splash のドキュメントからの正確なコピーです。

そのため、Web ページに設定されている Cookie にアクセスしようとしています。Splash を使用していない場合、以下のコードは期待どおりに動作しますが、Splash を使用している場合は動作しません。

self.logger.debug('Cookies: %s', response.headers.get('Set-Cookie'))

これは、スプラッシュの使用中に返されます:

2017-01-03 12:12:37 [スパイダー] デバッグ: Cookie: なし

スプラッシュを使用していない場合、このコードは機能し、Web ページから提供された Cookie を返します。

Splash のドキュメントには、次のコードが例として示されています。

def parse_result(self, response):
    # here response.body contains result HTML;
    # response.headers are filled with headers from last
    # web page loaded to Splash;
    # cookies from all responses and from JavaScript are collected
    # and put into Set-Cookie response header, so that Scrapy
    # can remember them.

これを正しく理解しているかどうかはわかりませんが、Splash を使用していない場合と同じように Cookie にアクセスできるはずです。

ミドルウェア設定:

# Download middlewares 
DOWNLOADER_MIDDLEWARES = {
    # Use a random user agent on each request
    'crawling.middlewares.RandomUserAgentDownloaderMiddleware': 400,
    'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
    'scrapy.downloadermiddlewares.cookies.CookiesMiddleware': 700,
    # Enable crawlera proxy
    'scrapy_crawlera.CrawleraMiddleware': 600,
    # Enable Splash to render javascript
    'scrapy_splash.SplashCookiesMiddleware': 723,
    'scrapy_splash.SplashMiddleware': 725,
    'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810, 
}

私の質問は、Splash リクエストの使用中に Cookie にアクセスするにはどうすればよいですか?

設定.py

スパイダー.py

score 0 · Accepted Answer

サーバー側から送信された「静的」ヘッダーからデータを取得しようとしていますが、ページ内の js コードも Cookie を生成できます。これは、スプラッシュが「splash:get_cookies()」を使用する理由を説明しています。応答時に「Cookie」から値にアクセスするには、lua スクリプトによって返されるテーブルを使用する必要があります。

return {
   url = splash:url(),
   headers = last_response.headers,
   http_status = last_response.status,
   cookies = splash:get_cookies(),
   html = splash:html(),
}

この行を変更してみてください

self.logger.debug('Cookies: %s', response.headers.get('Set-Cookie'))

に

self.logger.debug('Cookies: %s', response.cookies)

python - Splash リクエストから Cookie を読み取る

2 に答える 2

Related

Reference