python - Pythonに暗号化不可能な文字をデフォルトで文字列に置き換えさせる

Question

エンコードできない文字を文字列に置き換えるだけで、Pythonに文字を無視させたいです"<could not encode>"。

たとえば、デフォルトのエンコーディングがASCIIであると仮定すると、コマンド

'%s is the word'%'ébác'

降伏します

'<could not encode>b<could not encode>c is the word'

すべてのプロジェクトで、これをデフォルトの動作にする方法はありますか？

score 11 · Accepted Answer

このstr.encode関数は、エラー処理を定義するオプションの引数を取ります。

str.encode([encoding[, errors]])

ドキュメントから：

文字列のエンコードされたバージョンを返します。デフォルトのエンコーディングは、現在のデフォルトの文字列エンコーディングです。別のエラー処理スキームを設定するためにエラーが発生する場合があります。エラーのデフォルトは「strict」です。これは、エンコードエラーによってUnicodeErrorが発生することを意味します。その他の可能な値は、「ignore」、「replace」、「xmlcharrefreplace」、「backslashreplace」、およびcodecs.register_error（）を介して登録されたその他の名前です。「コーデック基本クラス」セクションを参照してください。可能なエンコーディングのリストについては、「標準エンコーディング」のセクションを参照してください。

あなたの場合、そのcodecs.register_error関数は興味深いかもしれません。

[悪い文字についての注意]

ちなみに、使用register_errorするときは、注意を払わない限り、個々の不良文字だけでなく、連続する不良文字のグループを文字列に置き換える可能性があることに注意してください。不正な文字の実行ごとに、文字ごとではなく、エラーハンドラへの呼び出しが1回発生します。

score 5 · Accepted Answer

>>> help("".encode)
Help on built-in function encode:

encode(...)
S.encode([encoding[,errors]]) -> object

Encodes S using the codec registered for encoding. encoding defaults
to the default encoding. errors may be given to set a different error
handling scheme. Default is 'strict' meaning that encoding errors raise
a UnicodeEncodeError. **Other possible values are** 'ignore', **'replace'** and
'xmlcharrefreplace' as well as any other name registered with
codecs.register_error that is able to handle UnicodeEncodeErrors.

したがって、たとえば：

>>> x
'\xc3\xa9b\xc3\xa1c is the word'
>>> x.decode("ascii")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)
>>> x.decode("ascii", "replace")
u'\ufffd\ufffdb\ufffd\ufffdc is the word'

独自のコールバックをcodecs.register_errorに追加して、選択した文字列に置き換えます。

python - Pythonに暗号化不可能な文字をデフォルトで文字列に置き換えさせる

2 に答える 2

Related

Reference