python - Unicode Normalization

Question

Is there a possible normalization path which brings both strings below to same value?

u'Aho\xe2\u20ac\u201cCorasick_string_matching_algorithm'
u'Aho\u2013Corasick string matching algorithm'

score 1 · Accepted Answer

あたかもWindows-1252 データであるかのようにデコードされたMojibake 、UTF-8 バイトがあるようです。Windows-1252 にエンコードされた 3 つの「文字」は、ターゲット文字列の U+2013 EN DASH 文字の正確な 3 UTF-8 バイトを生成します。

>>> u'\u2013'.encode('utf8')
'\xe2\x80\x93'
>>> u'\u2013'.encode('utf8').decode('windows-1252')
u'\xe2\u20ac\u201c'

ftfyモジュールを使用してそのデータを修復できるため、バイトの emdash を取得できます。

>>> import ftfy
>>> sample = u'Aho\xe2\u20ac\u201cCorasick_string_matching_algorithm'
>>> ftfy.fix_text(sample)
u'Aho\u2013Corasick_string_matching_algorithm'

次に、アンダースコアをスペースに置き換えます。

>>> ftfy.fix_text(sample).replace('_', ' ')
u'Aho\u2013Corasick string matching algorithm'

単に Windows-1252 にエンコードし、UTF-8 として再度デコードすることもできますが、Windows-1252 として合法的にデコードできない特定のバイトがあるため、常に機能するとは限りません。ftfyそのプロセスを元に戻すための専用の修復コーデックが含まれています。さらに、可能性のある複数のコーデックエラーにわたってプロセスを自動化するために行われた特定のモジバケエラーを検出します。

python - Unicode Normalization

1 に答える 1

Related

Reference