python - urllib2 が HTML を返さない

Question

Trying to spider/crawl through a third-party website, but I seem to have hit a snag:

urlopen'ing a site gets a response, but reading and printing the HTML seems to tell me that I'm getting nothing back. Could this be due to some kind of blocking on the other end? or anything?

currently, I'm trying to open New York Times articles. The main pages return HTML, the articles, uh, don't.

try:
    source = urllib.urlopen(target_site)
    html =  source.read()
    print "HTML: ", html.lower()

output:

HTML:
(other stuff)

Oh, and it also times out once in a while, but that's a different story, I'm hoping.

score 3 · Accepted Answer

これは、ニューヨークタイムズの記事の問題ではありません。ヘッダーに適切なユーザーエージェントがないため、ページが拒否されている可能性があります。この投稿では、その方法を説明しています。

その場合は、これを試してください。

try:
    req = urllib2.Request(target_site)
    req.add_header("User-Agent", "Mozilla/5.0")
    source = urllib.urlopen(req)
    html =  source.read()
    print "HTML: ", html.lower()

それをスクラッチします。NewYorkTimesの記事ではそれは問題ではありません。これは、nytimes.comがCookieを提供しようとするためです。しかし、それはできません。これにより、リダイレクトループが発生します。Cookieを処理できるカスタムURLオープナーを作成する必要があります。これにアクセスするには、次のようにします。

#make an url opener that can handle cookies
opener = urllib2.build_opener(urllib2.HTTPCookieHandler())
#read in the site
response = opener.open(target_site)
html = response.read()

それが正しい記事であることを確認するために、それを書き出してウェブブラウザで開くことができます。

score 0 · Accepted Answer

リクエスト用のプラグを追加しようと思いました。これは比較的簡単に実行できます。後easy_install requestsまたはpip install requests：

import requests

page = requests.get(page_url)
html = page.content

編集：requests.get質問へのコメントに投稿されたURLを見て、それがそのページで機能することを確認すると思いました。

python - urllib2 が HTML を返さない

3 に答える 3

Related

Reference