python - PythonでNLTKを使用して文字のリストの出現を効率的に見つける方法は?

Question

NLTK python2.6 でテキストコーパスを読み取ることができます。

from nltk.corpus import gutenberg
for fileid in gutenberg.fileids():
    num_chars = len(gutenberg.raw(fileid)) 
    num_words = len(gutenberg.words(fileid))
    num_sents = len(gutenberg.sents(fileid))
    num_vocab = len(set([w.lower() for w in gutenberg.words(fileid)]))
    print int(num_chars/num_words), int(num_words/num_sents), int(num_words/num_vocab), fileid

ここで、num_letters(whole_text, ['a', 'bb', 'ccc']) のような、単語や文ごとの文字の平均出現回数を見つけたいと思います。予想される出力は次のとおりです。

a = n11/n12、bb = n21/n22、ccc = n31/n32

ここで、n11 = 単語内での出現、n12 = 文内での出現。

score 2 · Accepted Answer

これを行うには、正規表現を使用して、テキストの大部分で照合する各要素のすべての一致を検索します。

import re
matches = ['a', 'bb', 'ccc', 'and']

#add this line into your for loop:
    num_letter_dict = dict([(match, len([seq.start() for seq in 
            re.finditer(match, gutenberg.raw(fileid))])) for match in matches])

これにより、すべての一致とその頻度の辞書が作成されます。したがって、最初のテキストについては、次のausten-emma.txtようになりnum_letter_dictます。

{'a': 53669, 'and': 5257, 'ccc': 0, 'bb': 52}

ここから単語と文の平均出現回数に移動するには、それぞれnum_wordsとで割るだけです。num_sents

これらの要素を含む単語の数を見つけるには（単語内の繰り返しはカウントされません）、次を使用します。

num_letter_in_words = dict([(match, len([word for word in gutenberg.words(fileid)
                                      if match in word])) for match in matches])
#from the same text gives:
{'a': 50043, 'and': 5257, 'ccc': 0, 'bb': 52}

例として：

text = 'apples pairs bannanas'
matches = ['a', 'n', 'p']
#gives:
{'a': 3, 'p': 2, 'n': 1}

python - PythonでNLTKを使用して文字のリストの出現を効率的に見つける方法は?

1 に答える 1

Related

Reference