python - Pythonでコーパスから最も頻繁に使用される単語を抽出する

Question

これはばかげた質問かもしれませんが、Python を使用してコーパスから最も頻繁に使用される 10 の単語を抽出する際に問題があります。これは私がこれまでに得たものです。(ところで、私は NLTK を使用して、10 個の .txt ファイルごとに 2 つのサブカテゴリを持つコーパスを読み取ります)

import re
import string
from nltk.corpus import stopwords
stoplist = stopwords.words('dutch')

from collections import defaultdict
from operator import itemgetter

def toptenwords(mycorpus):
    words = mycorpus.words()
    no_capitals = set([word.lower() for word in words]) 
    filtered = [word for word in no_capitals if word not in stoplist]
    no_punct = [s.translate(None, string.punctuation) for s in filtered] 
    wordcounter = {}
    for word in no_punct:
        if word in wordcounter:
            wordcounter[word] += 1
        else:
            wordcounter[word] = 1
    sorting = sorted(wordcounter.iteritems(), key = itemgetter, reverse = True)
    return sorting

この関数をコーパスで出力すると、後ろに「1」が付いたすべての単語のリストが表示されます。それは私に辞書を与えますが、私の値はすべて1つです。そして、たとえば、「赤ちゃん」という単語が私のコーパスに5〜6回あることを知っています...それでも「赤ちゃん：1」が返されます...だから、私が望むように機能しません...
誰か助けてくれますか自分？

score 3 · Accepted Answer

問題はの使い方にありsetます。

セットには重複が含まれないため、小文字で一連の単語を作成すると、それ以降の各単語の出現は 1 回だけになります。

あなたがいるとしましょうwords：

 ['banana', 'Banana', 'tomato', 'tomato','kiwi']

ラムダがすべてのケースを下げると、次のようになります。

 ['banana', 'banana', 'tomato', 'tomato','kiwi']

しかし、次のようにします。

 set(['banana', 'Banana', 'tomato', 'tomato','kiwi'])

戻り値:

 ['banana','tomato','kiwi']

その瞬間から、セットに基づいて計算を行うためno_capitals、各単語の出現は 1 回だけになります。を作成しないsetでください。プログラムはおそらく問題なく動作します。

python - Pythonでコーパスから最も頻繁に使用される単語を抽出する

4 に答える 4

Related

Reference