python - Python リクエストライブラリを使用して Cookie を含むページを取得する

Question

リクエストライブラリ ( http://docs.python-requests.org/en/latest/ ) を調べているところですが、リクエストを使用して Cookie を含むページを取得する方法に問題がありました。

例えば：

url2= 'https://passport.baidu.com'
parsedCookies={'PTOKEN': '412f...', 'BDUSS': 'hnN2...', ...} #Sorry that the cookies value is replaced by ... for instance of privacy
req = requests.get(url2, cookies=parsedCookies)
text=req.text.encode('utf-8','ignore')
f=open('before.html','w')
f.write(text)
f.close()
req.close()

上記のコードを使用してページを取得すると、ログインページがログインページではなく「before.html」に保存されるだけで、実際には正常にログインしていないことがわかります。

しかし、URLlib2 を使用してページをフェッチすると、期待どおりに正しく動作します。

parsedCookies="PTOKEN=412f...;BDUSS=hnN2...;..." #Different format but same content with the aboved cookies
req = urllib2.Request(url2)
req.add_header('Cookie', parsedCookies)
ret = urllib2.urlopen(req)
f=open('before_urllib2.html','w')
f.write(ret.read())
f.close()
ret.close()

これらのコードを使用すると、ログインしたページがに保存されbefore_urllib2.htmlます。

--

コードに間違いはありますか? どんな返信でも感謝します。

score 2 · Accepted Answer

Session オブジェクトを使用して、必要なものを取得できます。

url2='http://passport.baidu.com'
session = requests.Session()  # create a Session object 
cookie = requests.utils.cookiejar_from_dict(parsedCookies) 
session.cookies.update(cookie) # set the cookies of the Session object

req = session.get(url2, headers=headers,allow_redirects=True)

requests.get 関数を使用すると、リダイレクトされたページの Cookie は送信されません。代わりに、Session().get 関数を使用すると、すべての http 要求に対して Cookie が保持され、送信されます。これが、「セッション」という概念が正確に意味するものです。

ここで何が起こるかを詳しく説明しましょう。

Cookie を送信しhttp://passport.baidu.com/centerてパラメーター allow_redirects を false に設定すると、返されるステータスコードは 302 で、応答のヘッダーの 1 つは 'location': '/center?_t=1380462657' です (これはサーバーによって生成される動的な値です。サーバーから取得したものに置き換えることができます）：

url2= 'http://passport.baidu.com/center'
req = requests.get(url2, cookies=parsedCookies, allow_redirects=False)
print req.status_code # output 302
print req.headers

しかし、パラメーター allow_redirects を True に設定しても、ページ ( http://passport.baidu.com/center?_t=1380462657) にリダイレクトされず、サーバーはログインページを返します。これは、requests.get がリダイレクトされたページ (ここでは ) の Cookie を送信しないため、http://passport.baidu.com/center?_t=1380462657正常にログインできるためです。そのため、Session オブジェクトが必要です。

を設定url2 = http://passport.baidu.com/center?_t=1380462657すると、目的のページが返されます。1 つの解決策は、上記のコードを使用して動的な場所の値を取得し、アカウントへのパスをのように形成するとhttp://passport.baidu.com/center?_t=1380462657、目的のページを取得できます。

url2= 'http://passport.baidu.com' + req.headers.get('location')
req = session.get(url2, cookies=parsedCookies, allow_redirects=True )

しかし、これは面倒なので、Cookie を扱うときは、Session オブジェクトが優れた仕事をしてくれます。

python - Python リクエスト ライブラリを使用して Cookie を含むページを取得する

1 に答える 1

Related

Reference

python - Python リクエストライブラリを使用して Cookie を含むページを取得する