python - nltk plaintextcorpusの単語をより速くカウントするにはどうすればよいですか？

Question

一連のドキュメントがあり、各タプルに特定のドキュメントの日付と、特定の検索語がそのドキュメントに表示される回数が含まれるタプルのリストを返したいと思います。私のコード（以下）は機能しますが、遅く、私はn00bです。これをより速くする明白な方法はありますか？主に私がより良いコーディングを学ぶことができるように、そしてまた私がこのプロジェクトをより速く終わらせることができるように、どんな助けでも大いに感謝されるでしょう！

def searchText(searchword):
    counts = []
    corpus_root = 'some_dir'
    wordlists = PlaintextCorpusReader(corpus_root, '.*')
    for id in wordlists.fileids():
        date = id[4:12]
        month = date[-4:-2]
        day = date[-2:]
        year = date[:4]
        raw = wordlists.raw(id)
        tokens = nltk.word_tokenize(raw)
        text = nltk.Text(tokens)
        count = text.count(searchword)
        counts.append((month, day, year, count))

    return counts

score 8 · Accepted Answer

単語数の頻度だけが必要な場合は、nltk.Textオブジェクトを作成する必要はなく、を使用する必要もありませんnltk.PlainTextReader。代わりに、に直行してnltk.FreqDistください。

files = list_of_files
fd = nltk.FreqDist()
for file in files:
    with open(file) as f:
        for sent in nltk.sent_tokenize(f.lower()):
            for word in nltk.word_tokenize(sent):
                fd.inc(word)

または、分析を行いたくない場合は、を使用してdictください。

files = list_of_files
fd = {}
for file in files:
    with open(file) as f:
        for sent in nltk.sent_tokenize(f.lower()):
            for word in nltk.word_tokenize(sent):
                try:
                    fd[word] = fd[word]+1
                except KeyError:
                    fd[word] = 1

これらはジェネレータ式を使用するとはるかに効率的にすることができますが、私は読みやすさのためにループに使用されています。

python - nltk plaintextcorpusの単語をより速くカウントするにはどうすればよいですか？

1 に答える 1

Related

Reference