python - UnicodeEncodeError: 'ascii' コーデックは文字 u'\u2026' をエンコードできません

Question

私は urllib2 と Beautiful Soup について学んでおり、最初のテストで次のようなエラーが発生しています。

UnicodeEncodeError: 'ascii' codec can't encode character u'\u2026' in position 10: ordinal not in range(128)

このタイプのエラーに関する投稿がたくさんあるようで、理解できる解決策を試しましたが、キャッチ 22 があるようです。

印刷したいですpost.text（ここで、テキストはテキストを返すだけの美しいスープメソッドです）。 str(post.text)ユニコードエラーをpost.text生成します（右アポストロフィ'やなど...）。

post = unicode(post)したがって、上記を追加すると、次のようstr(post.text)になります。

AttributeError: 'unicode' object has no attribute 'text'

私も試し(post.text).encode()てみ(post.text).renderContents()ました。後者はエラーを生成します：

AttributeError: 'unicode' object has no attribute 'renderContents'

そして、私は試しstr(post.text).renderContents()てみましたが、エラーが発生しました:

AttributeError: 'str' object has no attribute 'renderContents'

'make this content 'interpretable''ドキュメントの上部のどこかを定義するだけで、必要な機能にアクセスできれば素晴らしいと思いtextます。

更新： 提案後：

post = post.decode("utf-8")上記を追加すると、次のようstr(post.text)になります。

TypeError: unsupported operand type(s) for -: 'str' and 'int'

post = post.decode()上記を追加すると、次のようstr(post.text)になります。

AttributeError: 'unicode' object has no attribute 'text'

post = post.encode("utf-8")上記を追加すると、次のよう(post.text)になります。

AttributeError: 'str' object has no attribute 'text'

私は試しprint post.text.encode('utf-8')てみました：

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 39: ordinal not in range(128)

そして、うまくいくかもしれないことを試すために、ここからWindows用のlxmlをインストールし、次のように実装しました。

parsed_content = BeautifulSoup(original_content, "lxml")

http://www.crummy.com/software/BeautifulSoup/bs4/doc/#output-formattersによると。

これらの手順は違いを生むようには見えませんでした。

Python 2.7.4 と Beautiful Soup 4 を使用しています。

解決：

Unicode、utf-8、および Beautiful Soup の型をより深く理解した後、それは私の印刷方法論と関係がありました。たとえば、すべてのstrメソッドと連結を削除したため、この段階で書式設定をあまり制御できないことを除いて (例: にスペースを挿入)、うまく印刷されているように見えます。str(something) + post.text + str(something_else)something, post.text, something_else,

score 46 · Accepted Answer

Python 2 では、unicodeASCII に変換できる場合にのみ、オブジェクトを印刷できます。ASCII でエンコードできない場合は、そのエラーが発生します。おそらく明示的にエンコードしてから、結果を出力したいと思うでしょうstr:

print post.text.encode('utf-8')

score 2 · Accepted Answer

    html = urllib.request.urlopen(THE_URL).read()
    soup = BeautifulSoup(html)
    print("'" + str(soup.encode("ascii")) + "'")

私のために働いた;-)

score 0 · Accepted Answer

または試しました.decode()か.decode("utf-8")？

そして、私は使用することをお勧めしlxmlますhtml5lib parser

http://lxml.de/html5parser.html

python - UnicodeEncodeError: 'ascii' コーデックは文字 u'\u2026' をエンコードできません

3 に答える 3

Related

Reference