python - Python NLTK：単語のリストを数え、有効な英語の単語で確率を上げる

Question

無効な英語の単語や数字などを含む汚い文書があります。有効な英語の単語をすべて取得して、有効な英語の単語の総数に対する単語のリストの比率を計算したいだけです。

たとえば、私のドキュメントに次の文が含まれている場合：

sentence= ['eishgkej he might be a good person. I might consider this.']

数えたいだけ"he might be a good person. I might consider this"で数えたい"might"。

だから、私は2/10の答えを得ました。

以下のコードを使うことを考えています。ただし、行features[word] = 1ではなく機能の数を変更する必要があります...

 all_words = nltk.FreqDist(w.lower() for w in reader.words() if w.lower() not in english_sw)

 def document_features(document):
     document_words = set(document)
     features = {}
     for word in word_features:
         if word in document_words:
             features[word] = 1
         else:
             features[word]=0
     return features

score 1 · Accepted Answer

ドキュメントによると、オブジェクトcount(self, sample)内の単語の数を返すために使用できます。FreqDistだから私はあなたが次のようなものが欲しいと思います:

 for word in word_features:
     if word in document_words:
         features[word] = all_words.count(word)
     else:
         features[word]= 0

または、インデックスを使用することもできます。つまりall_words[word]、次と同じものを返す必要がありますall_words.count(word)

単語の頻度が必要な場合は、行うことができますall_words.freq(word)

python - Python NLTK：単語のリストを数え、有効な英語の単語で確率を上げる

1 に答える 1

Related

Reference