python - Python の一般的なユニコード

Question

Python 2.7.2 でユニコードを理解するのに問題があるため、アイドル状態でいくつかのテストを試みました。「わからない」とマークされているものは 2 つあります。失敗する理由を教えてください。他の項目については、私のコメントが正しいか教えてください。

>>> s
'Don\x92t '  # s is a string
>>> u
u'Don\u2019t '  # u is a unicode object
>>> type(u)     # confirm u is unicode
<type 'unicode'>
>>> type(s)     # confirm s is string
<type 'str'>
>>> type(s) == 'str' # wrong way to test
False
>>> isinstance(s, str)  # right way to test
True
>>> print s
Don’t       # works because idle can handle strings
>>> print u
Don’t       # works because idle can handle unicode
>>> open('9', 'w').write(s.encode('utf8')) #encode takes unicode, but s is a string,
                                            # so this fails
Traceback (most recent call last):
  File "<pyshell#28>", line 1, in <module>
    open('9', 'w').write(s.encode('utf8'))
UnicodeDecodeError: 'ascii' codec can't decode byte 0x92 in position 3: ordinal not in range(128)
>>> open('9', 'w').write(s) # write can write strings
>>> open('9', 'w').write(u) # write can't write unicode

Traceback (most recent call last):
  File "<pyshell#30>", line 1, in <module>
    open('9', 'w').write(u)
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019' in position 3: ordinal not in range(128)
>>> open('9', 'w').write(u.encode('utf8'))  # encode turns unicode to string, which write can handle
>>> open('9', 'w').write(s.decode('utf8'))  # decode turns string to unicode, which write can't handle

Traceback (most recent call last):
  File "<pyshell#32>", line 1, in <module>
    open('9', 'w').write(s.decode('utf8'))
  File "C:\program files\Python27\lib\encodings\utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x92 in position 3: invalid start byte
>>> e = '{}, {}'.format(s, u) # fails becase ''.format is string, while u is unicode

Traceback (most recent call last):
  File "<pyshell#33>", line 1, in <module>
    e = '{}, {}'.format(s, u)
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019' in position 3: ordinal not in range(128)
>>> e = '{}, {}'.format(s, u.encode('utf8')) # works because u.encode is a string
>>> e = u'{}, {}'.format(s, u) # not sure

Traceback (most recent call last):
  File "<pyshell#36>", line 1, in <module>
    e = u'{}, {}'.format(s, u)
UnicodeDecodeError: 'ascii' codec can't decode byte 0x92 in position 3: ordinal not in range(128)
>>> e = u'{}, {}'.format(s.decode('utf8'), u) # not sure

Traceback (most recent call last):
  File "<pyshell#55>", line 1, in <module>
    e = u'{}, {}'.format(s.decode('utf8'), u)
  File "C:\program files\Python27\lib\encodings\utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x92 in position 3: invalid start byte

>>> e = '\n'.join([s, u]) # wants strings, but u is unicode

Traceback (most recent call last):
  File "<pyshell#37>", line 1, in <module>
    e = '\n'.join([s, u])
UnicodeDecodeError: 'ascii' codec can't decode byte 0x92 in position 3: ordinal not in range(128)
>>> e = '\n'.join([s, u.encode('utf8')]) # u.encode is now a string

score 2 · Accepted Answer

最初sはエンコードされた文字列ではなく、utf-8おそらくcp1250エンコードされた文字列です。したがって、それを使用してデコードするとutf-8常に失敗します。

>>> e = u'{}, {}'.format(s, u) # not sure

最初の「不明」は、関数のすべての引数を文字列にエンコードしようとするためu'{}, {}'です。しかし、がエンコードされているものがわからないため、がとしてエンコードされていると想定し、 (基本的にはを実行して)としてデコードしようとしますが、はエンコードされた文字列であるため失敗します。unicodeformatunicodessasciiasciis.decode('ascii')scp1250

>>> e = u'{}, {}'.format(s.decode('utf8'), u) # not sure

としてデコードしようとしたため、2 つ目は失敗しますがutf-8、実際には、前述のように、utf-8.

score 1 · Accepted Answer

1

于 2013-08-14T14:43:56.033 に答える

python - Python の一般的なユニコード

2 に答える 2

Related

Reference