python - 入力ファイル内の一意の文字列から句読点を取り除く

Question

この質問 ( Python で文字列から句読点を取り除く最良の方法) は、個々の文字列から句読点を取り除くことを扱います。ただし、入力ファイルからテキストを読み取ることを望んでいますが、句読点を終了せずにすべての文字列の 1 つのコピーのみを出力します。私はこのようなことを始めました：

f = open('#file name ...', 'a+')
for x in set(f.read().split()):
    print x

しかし、問題は、入力ファイルにたとえば次の行がある場合です。

This is not is, clearly is: weird

「is」の 3 つの異なるケースを異なる方法で扱いますが、句読点を無視して、「is」を 3 回ではなく 1 回だけ出力したいと考えています。任意の種類の終了句読点を削除して、結果の文字列をセットに入れるにはどうすればよいですか?

助けてくれてありがとう。（私はPythonに本当に慣れていません。）

score 1 · Accepted Answer

import re

for x in set(re.findall(r'\b\w+\b', f.read())):

単語をより正確に区別できるはずです。

この正規表現は、英数字 (az、AZ、0-9、_) のコンパクトなグループを検出します。

文字のみ (数字とアンダースコアなし) を検索する場合は、をに置き換え\wます[a-zA-Z]。

>>> re.findall(r'\b\w+\b', "This is not is, clearly is: weird")
['This', 'is', 'not', 'is', 'clearly', 'is', 'weird']

score 0 · Accepted Answer

たとえば、句読点を空白に置き換えても構わない場合は、変換テーブルを使用できます。

>>> from string import maketrans
>>> punctuation = ",;.:"
>>> replacement = "    "
>>> trans_table = maketrans(punctuation, replacement)
>>> 'This is not is, clearly is: weird'.translate(trans_table)
'This is not is  clearly is  weird'
# And for your case of creating a set of unique words.
>>> set('This is not is  clearly is  weird'.split())
set(['This', 'not', 'is', 'clearly', 'weird'])

python - 入力ファイル内の一意の文字列から句読点を取り除く

2 に答える 2

Related

Reference