python - urlopen、BeautifulSoup、および UTF-8 の問題

Question

Web ページを取得しようとしているところですが、なぜか HTML ファイルに外国語の文字が埋め込まれています。「ソースの表示」を使用すると、この文字は表示されません。

isbn = 9780141187983
url = "http://search.barnesandnoble.com/booksearch/isbninquiry.asp?ean=%s" % isbn
opener = urllib2.build_opener()
url_opener = opener.open(url)
page = url_opener.read()
html = BeautifulSoup(page) 
html #This line causes error.

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 21555: ordinal not in range(128)

私も試しました...

html = BeautifulSoup(page.encode('utf-8'))

このエラーを発生させずに、この Web ページを BeautifulSoup に読み込むにはどうすればよいですか?

score 12 · Accepted Answer

このエラーは、BeautifulSoupファイルの表現を印刷しようとしたときに実際に発生している可能性があります。これは、おそらくインタラクティブコンソールで作業している場合に自動的に発生します。

# This code will work fine, note we are assigning the result 
# of the BeautifulSoup object to prevent it from printing immediately.
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(u'\xa0')

# This will probably show the error you saw
print soup

# And this would probably be fine
print soup.encode('utf-8')

python - urlopen、BeautifulSoup、および UTF-8 の問題

2 に答える 2

Related

Reference