python - urlopen から意味不明

Question

以下のコードのアドレスからいくつかの utf-8 ファイルを読み込もうとしています。それらのほとんどで機能しますが、一部のファイルでは urllib2 (および urllib) を読み取ることができません。

ここでの明白な答えは、2 番目のファイルが破損しているということですが、奇妙なことに、IE は両方とも問題なく読み取ることができます。このコードは、XP と Linux の両方でテストされており、同じ結果が得られています。何か提案はありますか？

import urllib2
#This works:
f=urllib2.urlopen("http://www.gutenberg.org/cache/epub/145/pg145.txt")
line=f.readline()
print "this works: %s)" %(line)
line=unicode(line,'utf-8') #... works fine

#This doesn't
f=urllib2.urlopen("http://www.gutenberg.org/cache/epub/144/pg144.txt")
line=f.readline()
print "this doesn't: %s)" %(line)
line=unicode(line,'utf-8')#...causes an exception:

score 2 · Accepted Answer

>>> f=urllib2.urlopen("http://www.gutenberg.org/cache/epub/144/pg144.txt")
>>> f.headers.dict
{'content-length': '304513', ..., 'content-location': 'pg144.txt.utf8.gzip', 'content-encoding': 'gzip', ..., 'content-type': 'text/plain; charset=utf-8'}

サイトが gzip でエンコードされた応答を送信しないようにするヘッダーを設定するか、最初にデコードします。

score -1 · Accepted Answer

あなたはそれが解決策ではないことを知っていますが、http://pypi.python.org/pypi/requestsライブラリを見てください-8弦。

python - urlopen から意味不明

3 に答える 3

Related

Reference