python - Python3:非ASCII文字のエスケープ解除

Question

(Python 3.3.2) re.escape() の呼び出しによって返された非 ASCII エスケープ文字をアンエスケープする必要があります。私はこことここでうまくいかない方法を見ます。私は 100% UTF-8 環境で作業しています。

# pure ASCII string : ok
mystring = "a\n" # expected unescaped string : "a\n"
cod = codecs.getencoder('unicode_escape')
print( cod(mystring) )

# non ASCII string : method #1
mystring = "€\n"
# equivalent to : mystring = codecs.unicode_escape_decode(mystring)
cod = codecs.getdecoder('unicode_escape')
print(cod(mystring))
# RESULT = ('â\x82¬\n', 5) INSTEAD OF ("€\n", 2)

# non ASCII string : method #2
mystring = "€\n"
mystring = bytes(mystring, 'utf-8').decode('unicode_escape')
print(mystring)
# RESULT = â\202¬ INSTEAD OF "€\n"

これはバグですか？私は何かを誤解していますか？

どんな助けでも大歓迎です！

PS : Michael Foukarakis の発言のおかげで、投稿を編集しました。

score 2 · Accepted Answer

処理する必要がある実際の文字列はmystring = €\\n?

mystring = "€\n"  # that's 2 char, "€" and new line
mystring = "€\\n" # that's 3 char, "€", "\" and "n"

encode()python3 内とpython3の何が問題なのかはよくわかりませんdecode()が、ツールを書いているときに友人がこの問題を解決してくれます。

私たちが行った方法は、エスケープ手順が完了した後にをバイパスすることです。encoder("utf_8")

>>> "€\\n".encode("utf_8")
b'\xe2\x82\xac\\n'
>>> "€\\n".encode("utf_8").decode("unicode_escape")
'â\x82¬\n'
>>> "€\\n".encode("utf_8").decode("unicode_escape").encode("utf_8")
b'\xc3\xa2\xc2\x82\xc2\xac\n'  # we don't want this
>>> bytes([ord(char) for char in "€\\n".encode("utf_8").decode("unicode_escape")])
b'\xe2\x82\xac\n'  # what we really need
>>> str(bytes([ord(char) for char in "€\\n".encode("utf_8").decode("unicode_escape")]), "utf_8")
'€\n'

の結果はdecode("unicode_escape")配線されているように見えますが、bytesオブジェクトには実際には文字列の正しいバイトが含まれています (utf-8 エンコーディングを使用)。この場合、"\xe2\x82\xac\n"

そして、strオブジェクトを直接出力することも、を使用することもencode("utf_8")、を使用してオブジェクトord()を作成することもありません。bytesb'\xe2\x82\xac\n'

strそして、このbytesオブジェクトから正しいものを取得できます。str()

ところで、私の友人と私が作りたいツールは、ユーザーが c のような文字列リテラルを入力し、エスケープされたシーケンスを自動的に変換できるようにするラッパーです。

User input:\n\x61\x62\n\x20\x21  # 20 characters, which present 6 chars semantically
output:  # \n
ab       # \x61\x62\n
 !       # \x20\x21

これは、ユーザーが端末で印刷できない文字を入力するための強力なツールです。

最終的なツールは次のとおりです。

#!/usr/bin/env python3
import sys 

for line in sys.stdin:
    sys.stdout.buffer.write(bytes([ord(char) for char in line[:-1].encode().decode('unicode_escape')]))
    sys.stdout.flush()

score 1 · Accepted Answer

エンコーディングを誤解しているようです。一般的なエラーから保護するために、通常、アプリケーションから出るときに文字列をエンコードし、入ってくるときにデコードします。

まず、unicode_escape のドキュメントを見てみましょう。

Python ソースコードの Unicode リテラルとして適した文字列を生成します。

ネットワークまたはコンテンツが Unicode エスケープされていると主張するファイルから得られるものは次のとおりです。

b'\\u20ac\\n'

これをアプリで使用するには、これをデコードする必要があります。

>>> s = b'\\u20ac\\n'.decode('unicode_escape')
>>> s
'€\n'

そして、たとえば Python ソースファイルに書き戻したい場合は、次のようにします。

with open('/tmp/foo', 'wb') as fh: # binary mode
    fh.write(b'print("' + s.encode('unicode_escape') + b'")')

python - Python3:非ASCII文字のエスケープ解除

3 に答える 3

Related

Reference