python - Python で Unicode 文字列を処理するこれらの方法の違いは何ですか?

Question

私は試しましprint a_str.decode("utf-8")たprint uni_str、、、..print uni_str.decode("utf-8")print uni_str.encode("utf-8")

しかし、最初のものだけが機能します。

 >>> print '\xe8\xb7\xb3'.decode("utf-8")
 跳
 >>> print u'\xe8\xb7\xb3\xe8'
 è·³è
 >>> print u'\xe8\xb7\xb3\xe8'.decode("utf-8")
 Traceback (most recent call last):
   File "<stdin>", line 1, in <module>
   File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/encodings/utf_8.py", line 16, in decode
     return codecs.utf_8_decode(input, errors, True)
 UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-3: ordinal not in range(128)
 >>> print u'\xe8\xb7\xb3\xe8'.encode("utf-8")
 è·³è

Unicode 文字列を正常に表示する方法に本当に混乱しています。次のような文字列がある場合: a=u'\xe8\xb7\xb3\xe8'、どうすれば印刷できますaか?

score 3 · Accepted Answer

3

于 2012-08-05T07:46:23.460 に答える

score 1 · Accepted Answer

そのような文字列がある場合、それは壊れています。同じバイト値を持つバイト文字列にするには、Latin-1 としてエンコードしてから、UTF-8 としてデコードする必要があります。

score 0 · Accepted Answer

Unicode文字列u'\xe8\xb7\xb3\xe8'は。と同等u'\u00e8\u00b7\u00b3\u00e8'です。必要なのはu'\u8df3'、utf8でとしてエンコードできるものです'\xe8\xb7\xb3'。

Pythonでは、UnicodeはUCS-2文字列（ビルドオプション）です。つまり、u'\xe8\xb7\xb3\xe8'4つの16ビットUnicode文字の文字列です。

utf-8文字列（8ビット文字列）が誤ってUnicode（16ビット文字列）として表示された場合は、最初にそれを8ビット文字列に変換する必要があります。

>>> ''.join([chr(ord(a)) for a in u'\xe8\xb7\xb3']).decode('utf8')
u'\u8df3'

最後のバイトは2バイトシーケンスの最初の文字であり、utf8文字列を終了できないため'\xe8\xb7\xb3\xe8'、有効なutf8文字列ではないことに注意してください。'\xe8'

python - Python で Unicode 文字列を処理するこれらの方法の違いは何ですか?

3 に答える 3

Related

Reference