python - この URL をダウンロードしたいのですが、エラーが発生します。...ユニコード.. (Python)

Question

theurl = 'http://bit.ly/6IcCtf/'
urlReq = urllib2.Request(theurl)
urlReq.add_header('User-Agent',random.choice(agents))
urlResponse = urllib2.urlopen(urlReq)
htmlSource = urlResponse.read()
if unicode == 1:
    #print urlResponse.headers['content-type']
    #encoding=urlResponse.headers['content-type'].split('charset=')[-1]
    #htmlSource = unicode(htmlSource, encoding)
    htmlSource =  htmlSource.encode('utf8')
return htmlSource

ユニコード部分を見てください。これらの 2 つのオプションを試しましたが、うまくいきません。

htmlSource =  htmlSource.encode('utf8')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe7 in position 370747: ordinal not in range(128)

また、これは、より長いエンコード方法を試したときに...

_mysql_exceptions.Warning: Incorrect string value: '\xE7\xB9\x81\xE9\xAB\x94...' for column 'html' at row 1

score 5 · Accepted Answer

あなたのhtmlデータは、すでに何らかのエンコーディングでエンコードされたインターネットからの文字列です。にエンコードする前にutf-8、まずデコードする必要があります。

Python はそれを暗黙的にデコードしようとしています (そのため、UnicodeDecodeErrornotが返されますUnicodeEncodeError)。

に再エンコードする前に、(適切なエンコーディングを使用して)バイト文字列を明示的にデコードすることで問題を解決できます。utf-8

例：

utf8encoded = htmlSource.decode('some_encoding').encode('utf-8')

ではなく、ページが最初にエンコードされた正しいエンコーディングを使用します'some_encoding'。

文字列をデコードする前に、文字列がどのエンコーディングを使用しているかを知る必要があります。

score 3 · Accepted Answer

デコードしない？htmlSource = htmlSource.decode('utf8')

デコードは「htmlSource を utf8 エンコーディングからデコードする」ことを意味します

encode は「htmlSource を utf8 エンコーディングにエンコードする」ことを意味します

既存のデータを抽出しているため (Web サイトからクロール)、それをデコードする必要があり、mysql に挿入するときは、mysql db/table/fields 照合に従って utf8 としてエンコードする必要がある場合があります。

score 1 · Accepted Answer

おそらく、エンコードではなく Utf8をデコードしたいでしょう:

htmlSource =  htmlSource.decode('utf8')

python - この URL をダウンロードしたいのですが、エラーが発生します。...ユニコード.. (Python)

3 に答える 3

Related

Reference