python - 文字以外の Unicode テキストを取り除く

Question

テキストファイルを入力として受け取り、非リテラル文字をすべて削除し、出力を別のファイルに書き込む単純な Python スクリプトを作成しようとしています。通常、私は次の 2 つの方法を実行します。

と組み合わせた正規表現を使用して、re.sub文字以外のすべての文字を空の文字列に置き換えます
すべての行のすべての文字を調べ、それがあった場合にのみ出力に書き込みますstring.lowercase

でも今回はイタリア語の神曲なので（I'm Italian）、Unicodeの文字が入っています。

èéï

および他のいくつか。スクリプトの最初の行に書い# -*- coding: utf-8 -*-たのですが、Python はスクリプト内に Unicode 文字が書かれている場合にエラーを通知しないということです。

次に、正規表現に Unicode 文字を含めようとしました。たとえば、次のように記述します。

u'\u00AB'

動作しているように見えますが、Python は、ファイルから入力を読み取るときに、読み取ったものを読み取ったのと同じように書き換えません。たとえば、一部の文字は平方根記号に変換されます。

私は何をすべきか？

score 2 · Accepted Answer

unicodedata.category(unichr)そのコードポイントのカテゴリを返します。

unicode.orgでカテゴリの説明を見つけることができますが、関連するものはL、N、P、Z、およびおそらくSグループです。

Lu    Uppercase_Letter    an uppercase letter
Ll    Lowercase_Letter    a lowercase letter
Lt    Titlecase_Letter    a digraphic character, with first part uppercase
Lm    Modifier_Letter a modifier letter
Lo    Other_Letter    other letters, including syllables and ideographs
...

You might also want to normalize your string first so that diacriticals that can attach to letters do so:

unicodedata.normalize(form, unistr)

Return the normal form form for the Unicode string unistr. Valid values for form are ‘NFC’, ‘NFKC’, ‘NFD’, and ‘NFKD’.

Putting all this together:

file_bytes = ...   # However you read your input
file_text = file_bytes.decode('UTF-8')
normalized_text = unicodedata.normalize('NFC', file_text)
allowed_categories = set([
    'Ll', 'Lu', 'Lt', 'Lm', 'Lo',  # Letters
    'Nd', 'Nl',                    # Digits
    'Po', 'Ps', 'Pe', 'Pi', 'Pf',  # Punctuation
    'Zs'                           # Breaking spaces
])
filtered_text = ''.join(
    [ch for ch in normalized_text
     if unicodedata.category(ch) in allowed_categories])
filtered_bytes = filtered_text.encode('UTF-8')  # ready to be written to a file

score 0 · Accepted Answer

import codecs
f = codecs.open('FILENAME', encoding='utf-8')
for line in f:
    print repr(line)
    print line

1. Unicode フォーメーションを提供
します。 2. ファイルに書かれているとおりに提供します。

うまくいけば、それはあなたを助けます:)

python - 文字以外の Unicode テキストを取り除く

2 に答える 2

Related

Reference