python - PythonでUnicode文字を含むWebページを取得する方法

Question

非 ASCII 文字を含む Web ページを取得して解析しようとしています (URL はhttp://www.one.co.ilです)。これは私が持っているものです:

url = "http://www.one.co.il"
req = urllib2.Request(url)
response = urllib2.urlopen(req)
encoding = response.headers.getparam('charset') # windows-1255
html = response.read() # The length of this is valid - about 31000-32000,
                       # but printing the first characters shows garbage -
                       # '\x1f\x8b\x08\x00\x00\x00\x00\x00', instead of
                       # '<!DOCTYPE'
html_decoded = html.decode(encoding)

最後の行は私に例外を与えます:

File "C:/Users/....\WebGetter.py", line 16, in get_page
  html_decoded = html.decode(encoding)
File "C:\Python27\lib\encodings\cp1255.py", line 15, in decode
  return codecs.charmap_decode(input,errors,decoding_table)
UnicodeDecodeError: 'charmap' codec can't decode byte 0xdb in position 14: character maps to <undefined>

urllib2 read to UnicodeやHow to handle response encoding from urllib.request.urlopen()などの他の関連する質問を見てみましたが、これについて役立つものは何も見つかりませんでした。

誰かが光を当てて、この主題について私を導いてくれませんか? ありがとう！

score 1 · Accepted Answer

0x1f 0x8b 0x08 は、gzip ファイルのマジックナンバーです。コンテンツを使用する前に解凍する必要があります。

python - PythonでUnicode文字を含むWebページを取得する方法

1 に答える 1

Related

Reference