python - Python とリクエストで Web ページをスクレイピングするときの文字セットの問題

Question

中国語のページをダウンロードしようとすると (メタタグによると gb2312 のようです)。以下のコードを実行し、gEdit で gb2312 形式のファイルを開くと、中国語の文字があるはずの ê×××(ò) などの意味不明な記号が表示されます。

問題のページのソースコードは次のとおりです: https://gist.github.com/anonymous/27663069655db7fd7a19 - 実際のサイトは教育機関専用です。

私のコード:

r = requests.post("http://example.com", data=payload, cookies=cookies)
f = open('myfile.txt', 'w')
f.write(r.text.encode('gb2312',errors="ignore"))
f.close()

ページのヘッダー:

{'content-length': '6164', 'x-powered-by': 'ASP.NET', 'date': 'Mon, 11 Mar 2013 05:11:24 GMT', 'cache-control': ' private', 'content-type': 'text/html', 'server': 'Microsoft-IIS/6.0'}

エンコードではなくデコードしようとすると、Python で次のエラーが発生します。

f.write(r.text.decode('gb2312',errors="ignore"))
UnicodeEncodeError: 'ascii' コーデックは位置 2017-2018 の文字をエンコードできません: 序数が範囲外です (128)

score 1 · Accepted Answer

djc@enrai http $ python
Python 2.7.3 (default, Jun 18 2012, 09:39:59)
[GCC 4.5.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import urllib
>>> rsp = urllib.urlopen('https://gist.github.com/anonymous/27663069655db7fd7a19/raw/836a5c55d0f87a2fa5edcc9a14097c945452f520/chinese.html').read()
>>> import chardet
>>> chardet.detect(rsp)
{'confidence': 0.99, 'encoding': 'utf-8'}
>>> rsp.decode('utf-8')
u'\n<HTML><HEAD>(snip)</BODY></HTML>\n'

それで、charset ヘッダーを信じないでください。

python - Python とリクエストで Web ページをスクレイピングするときの文字セットの問題

1 に答える 1

Related

Reference