python - Python で特定の文字列型から非 ASCII 文字を削除する

Question

>>> teststring = 'aõ'
>>> type(teststring)
<type 'str'>
>>> teststring
'a\xf5'
>>> print teststring
aõ
>>> teststring.decode("ascii", "ignore")
u'a'
>>> teststring.decode("ascii", "ignore").encode("ascii")
'a'

これは、非ASCII文字を削除するときに内部的に保存したかったものです。decode("ascii がユニコード文字列を出力したのはなぜですか?

>>> teststringUni = u'aõ'
>>> type(teststringUni)
<type 'unicode'>
>>> print teststringUni
aõ
>>> teststringUni.decode("ascii" , "ignore")

Traceback (most recent call last):
  File "<pyshell#79>", line 1, in <module>
    teststringUni.decode("ascii" , "ignore")
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf5' in position 1: ordinal not in range(128)
>>> teststringUni.decode("utf-8" , "ignore")

Traceback (most recent call last):
  File "<pyshell#81>", line 1, in <module>
    teststringUni.decode("utf-8" , "ignore")
  File "C:\Python27\lib\encodings\utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf5' in position 1: ordinal not in range(128)
>>> teststringUni.encode("ascii" , "ignore")
'a'

これがまた私が欲しかったものです。この振る舞いがわかりません。誰かがここで何が起こっているのか説明できますか?

編集:これで物事が理解できると思ったので、ここで述べている実際のプログラムの問題を解決できます: 非ASCII記号を含むUnicodeオブジェクトを文字列オブジェクトに変換する(Pythonで)

score 4 · Accepted Answer

なぜdecode("ascii")がUnicode文字列を出力したのですか?

それdecodeが目的だからです：ASCIIのようなバイト文字列をユニコードにデコードします。

2番目の例では、すでにユニコードである文字列を「デコード」しようとしていますが、これは効果がありません。ただし、端末に出力するには、Python はそれをデフォルトのエンコーディング (ASCII) でエンコードする必要がありますが、その手順を明示的に行っておらず、したがって「無視」パラメータを指定していないため、エラー非 ASCII 文字をエンコードできません。

これらすべての秘訣decodeは、がエンコードされたバイト文字列を受け取り、それを Unicode に変換encodeし、その逆を行うことを覚えておくことです。Unicode はエンコーディングではないことを理解しておくと簡単かもしれません。

score 4 · Accepted Answer

簡単です。.encode は Unicode オブジェクトを文字列に変換し、.decode は文字列を Unicode に変換します。

python - Python で特定の文字列型から非 ASCII 文字を削除する

2 に答える 2

Related

Reference