python - Unicode形式の文字列から句読点を削除します

Question

文字列のリストから句読点を削除する関数があります。

def strip_punctuation(input):
    x = 0
    for word in input:
        input[x] = re.sub(r'[^A-Za-z0-9 ]', "", input[x])
        x += 1
    return input

最近、他の非西洋文字を処理できるように、Unicode文字列を使用するようにスクリプトを変更しました。この関数は、これらの特殊文字に遭遇すると機能しなくなり、空のUnicode文字列を返すだけです。Unicode形式の文字列から句読点を確実に削除するにはどうすればよいですか？

score 76 · Accepted Answer

unicode.translate()次の方法を使用できます。

import unicodedata
import sys

tbl = dict.fromkeys(i for i in xrange(sys.maxunicode)
                      if unicodedata.category(unichr(i)).startswith('P'))
def remove_punctuation(text):
    return text.translate(tbl)

正規表現モジュールr'\p{P}'でサポートされているものを使用することもできます。

import regex as re

def remove_punctuation(text):
    return re.sub(ur"\p{P}+", "", text)

score 27 · Accepted Answer

Python3でJFSebastianのソリューションを使用する場合：

import unicodedata
import sys

tbl = dict.fromkeys(i for i in range(sys.maxunicode)
                      if unicodedata.category(chr(i)).startswith('P'))
def remove_punctuation(text):
    return text.translate(tbl)

score 9 · Accepted Answer

unicodedataモジュールのcategory関数を使用して文字列を反復処理し、文字が句読点であるかどうかを判別できます。

の可能な出力については、一般的なカテゴリ値に関するcategoryunicode.orgのドキュメントを参照してください。

import unicodedata.category as cat
def strip_punctuation(word):
    return "".join(char for char in word if cat(char).startswith('P'))
filtered = [strip_punctuation(word) for word in input]

さらに、エンコーディングとタイプを正しく処理していることを確認してください。このプレゼンテーションは、開始するのに適した場所です：http: //bit.ly/unipain

score 8 · Accepted Answer

Daenythの回答に基づく少し短いバージョン

import unicodedata

def strip_punctuation(text):
    """
    >>> strip_punctuation(u'something')
    u'something'

    >>> strip_punctuation(u'something.,:else really')
    u'somethingelse really'
    """
    punctutation_cats = set(['Pc', 'Pd', 'Ps', 'Pe', 'Pi', 'Pf', 'Po'])
    return ''.join(x for x in text
                   if unicodedata.category(x) not in punctutation_cats)

input_data = [u'somehting', u'something, else', u'nothing.']
without_punctuation = map(strip_punctuation, input_data)

python - Unicode形式の文字列から句読点を削除します

4 に答える 4

Related

Reference