python - PythonでUnicodeテキストをファイル名などに正規化する

Question

Python で国際的な Unicode テキストを安全な ID とファイル名に正規化するためのスタンドアロンのソリューションはありますか?

たとえばMy International Text: åäö、my-international-text-aao

plone.i18nは本当に良い仕事をしますが、残念ながらそれはや他のいくつかのパッケージに依存してzope.securityおりzope.publisher、脆弱な依存関係になっています。

score 35 · Accepted Answer

あなたがしたいことは、文字列の「スラッグ化」とも呼ばれます。考えられる解決策は次のとおりです。

import re
from unicodedata import normalize

_punct_re = re.compile(r'[\t !"#$%&\'()*\-/<=>?@\[\\\]^_`{|},.:]+')

def slugify(text, delim=u'-'):
    """Generates an slightly worse ASCII-only slug."""
    result = []
    for word in _punct_re.split(text.lower()):
        word = normalize('NFKD', word).encode('ascii', 'ignore')
        if word:
            result.append(word)
    return unicode(delim.join(result))

使用法：

>>> slugify(u'My International Text: åäö')
u'my-international-text-aao'

区切り文字を変更することもできます:

>>> slugify(u'My International Text: åäö', delim='_')
u'my_international_text_aao'

ソース: スラッグの生成

Python 3 の場合: pastebin.com/ft7Yb3KS ( @MrPoxipol に感謝)。

score 4 · Accepted Answer

この問題を解決する方法は、どの文字を許可するかを決定することです (システムごとに、有効な識別子の規則が異なります。

許可される文字を決定したら、allowed()述語とstr.translateで使用する dict サブクラスを記述します。

def makesafe(text, allowed, substitute=None):
    ''' Remove unallowed characters from text.
        If *substitute* is defined, then replace
        the character with the given substitute.
    '''
    class D(dict):
        def __getitem__(self, key):
            return key if allowed(chr(key)) else substitute
    return text.translate(D())

この機能は非常に柔軟です。どのテキストを保持し、どのテキストを置換または削除するかを決定するルールを簡単に指定できます。

「Unicode カテゴリ L にある文字のみを許可する」というルールを使用した簡単な例を次に示します。

import unicodedata

def allowed(character):
    return unicodedata.category(character).startswith('L')

print(makesafe('the*ides&of*march', allowed, '_'))
print(makesafe('the*ides&of*march', allowed))

そのコードは、次のように安全な出力を生成します。

the_ides_of_march
theidesofmarch

score 2 · Accepted Answer

私自身の（部分的な）解決策もここに投げます：

import unicodedata

def deaccent(some_unicode_string):
    return u''.join(c for c in unicodedata.normalize('NFD', some_unicode_string)
               if unicodedata.category(c) != 'Mn')

これはあなたが望むすべてを行うわけではありませんが、便利なメソッドにまとめられたいくつかの素晴らしいトリックを提供します：unicode.normalise('NFD', some_unicode_string)たとえば、ユニコード文字の分解を行い、'ä'を2つのユニコードコードポイントU+03B3とに分割しU+0308ます。

もう1つのメソッド、unicodedata.category(char)は、その特定ののenicode文字カテゴリを返しますchar。カテゴリMnにはすべての組み合わせアクセントが含まれているためdeaccent、単語からすべてのアクセントが削除されます。

ただし、これは部分的な解決策にすぎず、アクセントがなくなることに注意してください。この後も、許可したい文字のホワイトリストが必要です。

score 2 · Accepted Answer

次の例では、Unicode が組み合わせのペアに分解できる文字からアクセントを削除し、分解できない奇妙な文字を破棄し、空白を削除します。

# encoding: utf-8
from unicodedata import normalize
import re

original = u'ľ š č ť ž ý á í é'
decomposed = normalize("NFKD", original)
no_accent = ''.join(c for c in decomposed if ord(c)<0x7f)
no_spaces = re.sub(r'\s', '_', no_accent)

print no_spaces
# output: l_s_c_t_z_y_a_i_e

ファイルシステムで許可されていない文字を取り除こうとはしませんが、DANGEROUS_CHARS_REGEXそのためにリンクしたファイルから盗むことができます。

python - PythonでUnicodeテキストをファイル名などに正規化する

5 に答える 5

Related

Reference