python - スペース、句読点を含まないテキストファイルからすべての単語のリストを作成する

Question

長いテキストファイル (脚本) があります。後で検索できるように、このテキストファイルを (すべての単語が区切られた) リストに変換したいと考えています。

私が現時点で持っているコードは

file = open('screenplay.txt', 'r')
words = list(file.read().split())
print words

これはすべての単語をリストに分割するのに役立つと思いますが、単語の末尾にあるコンマやピリオドなどの余分なものをすべて削除するのに苦労しています。また、大文字を小文字にしたいと思います (小文字で検索して、大文字と小文字の両方の単語を表示できるようにしたいため)。どんな助けでも素晴らしいでしょう:)

score 8 · Accepted Answer

https://stackoverflow.com/a/17951315/284795のアルゴリズムを試してください。空白でテキストを分割し、句読点をトリムします。これにより、単語の端にある句読点が慎重に削除されますが、we're.

>>> text
"'Oh, you can't help that,' said the Cat: 'we're all mad here. I'm mad. You're mad.'"

>>> text.split()
["'Oh,", 'you', "can't", 'help', "that,'", 'said', 'the', 'Cat:', "'we're", 'all', 'mad', 'here.', "I'm", 'mad.', "You're", "mad.'"]

>>> [word.strip(string.punctuation) for word in text.split()]
['Oh', 'you', "can't", 'help', 'that', 'said', 'the', 'Cat', "we're", 'all', 'mad', 'here', "I'm", 'mad', "You're", 'mad']

あなたは追加したいかもしれません.lower()

score 4 · Accepted Answer

脚本は、一気に読み込めるように短くする必要があります。その場合は、translateメソッドを使用してすべての句読点を削除できます。最後に、次を使用して空白で分割するだけで、リストを作成できますstr.split。

import string

with open('screenplay.txt', 'rb') as f:
    content = f.read()
    content = content.translate(None, string.punctuation).lower()
    words = content.split()

print words

Mr.Smithに変わりますのでご注意くださいmrsmith。になりたい場合['mr', 'smith']は、すべての句読点をスペースに置き換えてから、次を使用できますstr.split。

def using_translate(content):
    table = string.maketrans(
        string.punctuation,
        ' '*len(string.punctuation))
    content = content.translate(table).lower()
    words = content.split()
    return words

などの正の正規表現パターンを使用して発生する可能性のある問題の 1 つ[a-z]+は、ASCII 文字のみに一致することです。ファイルにアクセント付きの文字が含まれていると、単語が分割されます。 Gruyèreとなり['Gruy','re']ます。

re.split句読点で分割するために使用することで修正できます。例えば、

def using_re(content):
    words = re.split(r"[ %s\t\n]+" % (string.punctuation,), content.lower())
    return words

ただし、使用するstr.translate方が高速です。

In [72]: %timeit using_re(content)
100000 loops, best of 3: 9.97 us per loop

In [73]: %timeit using_translate(content)
100000 loops, best of 3: 3.05 us per loop

score 1 · Accepted Answer

置換メソッドを使用します。

mystring = mystring.replace(",", "")

何度も使用するよりエレガントなソリューションが必要な場合は、正規表現を読んでください。ほとんどの言語はそれらを使用しており、より複雑な置換などに非常に役立ちます

score 0 · Accepted Answer

私はこのコードを試しましたが、私の場合はうまくいきます:

from string import punctuation, whitespace
s=''
with open("path of your file","r") as myfile:
    content=myfile.read().split()  
    for word in content:
        if((word in punctuation) or (word in whitespace)) :
            pass
        else:
            s+=word.lower()
print(s)

score 0 · Accepted Answer

このようなものを試すことができます。ただし、おそらく正規表現の作業が必要です。

import re
text = file.read()
words = map(lambda x: re.sub("[,.!?]", "", x).lower(), text.split())

python - スペース、句読点を含まないテキスト ファイルからすべての単語のリストを作成する

8 に答える 8

Related

Reference

python - スペース、句読点を含まないテキストファイルからすべての単語のリストを作成する