python - BeautifulSoup、requests、lxmlでのPythonデコードエラー

Question

人気のあるブラウザベースのゲームからいくつかのデータを取得しようとしていますが、いくつかのデコードエラーで問題が発生しています：

import requests
from bs4 import BeautifulSoup

r = requests.get("http://www.neopets.com/")
p = BeautifulSoup(r.text)

これにより、次のスタックトレースが生成されます。

Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "build/bdist.linux-x86_64/egg/bs4/__init__.py", line 172, in __init__

File "build/bdist.linux-x86_64/egg/bs4/__init__.py", line 185, in _feed

File "build/bdist.linux-x86_64/egg/bs4/builder/_lxml.py", line 195, in feed
File "parser.pxi", line 1187, in lxml.etree._FeedParser.close    (src/lxml/lxml.etree.c:87912)
File "parsertarget.pxi", line 130, in lxml.etree._TargetParserContext._handleParseResult (src/lxml/lxml.etree.c:97055)
File "lxml.etree.pyx", line 294, in lxml.etree._ExceptionContext._raise_if_stored (src/lxml/lxml.etree.c:8862)
File "saxparser.pxi", line 274, in lxml.etree._handleSaxCData (src/lxml/lxml.etree.c:93385)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xb1 in position 476: invalid start byte

次のことを行います。

print repr(r.text[476 - 10: 476 + 10])

生産：

u'ttp-equiv="X-UA-Comp'

ここでの問題が何であるかは本当にわかりません。どんな助けでも大歓迎です。ありがとうございました。

score 1 · Accepted Answer

.text応答でデコードされたUnicode値が返されますが、BeautifulSoupにデコードを行わせる必要があります。

p = BeautifulSoup(r.content, from_encoding=r.encoding)

r.contentデコードされていない生のバイト文字列を返し、r.encodingヘッダーから検出されたエンコーディングです。

python - BeautifulSoup、requests、lxmlでのPythonデコードエラー

1 に答える 1

Related

Reference