python - nltk.FreqDistの単語を2つのリストに分けますか？

Question

カスタムWebTextクラスのインスタンスである一連のテキストがあります。各テキストは、評価（-10〜 + 10）と単語数（nltk.FreqDist）が関連付けられたオブジェクトです。

>>trainingTexts = [WebText('train1.txt'), WebText('train2.txt'), WebText('train3.txt'), WebText('train4.txt')]
>>trainingTexts[1].rating
10
>>trainingTexts[1].freq_dist
<FreqDist: 'the': 60, ',': 49, 'to': 38, 'is': 34,...>

ポジティブ評価されたテキストで排他的に使用されるすべての単語を含む2つのリスト（または辞書）（trainingText[]。rating>0）と、ネガティブテキストで排他的に使用されるすべての単語を含む別のリスト（trainingText[]。rating<）を取得するにはどうすればよいですか。 0）。そして、各リストにすべてのポジティブまたはネガティブテキストの合計単語数が含まれるようにして、次のようにします。

>>only_positive_words
[('sky', 10), ('good', 9), ('great', 2)...] 
>>only_negative_words
[('earth', 10), ('ski', 9), ('food', 2)...]

セットには一意のインスタンスが含まれているため、セットの使用を検討しましたが、nltk.FreqDistを使用してこれを行う方法がわかりません。さらに、セットは単語の頻度で並べ替えられません。何か案は？

score 2 · Accepted Answer

さて、あなたがテストの目的でこれから始めたとしましょう：

class Rated(object): 
  def __init__(self, rating, freq_dist): 
    self.rating = rating
    self.freq_dist = freq_dist

a = Rated(5, nltk.FreqDist('the boy sees the dog'.split()))
b = Rated(8, nltk.FreqDist('the cat sees the mouse'.split()))
c = Rated(-3, nltk.FreqDist('some boy likes nothing'.split()))

trainingTexts = [a,b,c]

その場合、コードは次のようになります。

from collections import defaultdict
from operator import itemgetter

# dictionaries for keeping track of the counts
pos_dict = defaultdict(int)
neg_dict = defaultdict(int)

for r in trainingTexts:
  rating = r.rating
  freq = r.freq_dist

  # choose the appropriate counts dict
  if rating > 0:
    partition = pos_dict
  elif rating < 0: 
    partition = neg_dict
  else:
    continue

  # add the information to the correct counts dict
  for word,count in freq.iteritems():
    partition[word] += count

# Turn the counts dictionaries into lists of descending-frequency words
def only_list(counts, filtered):
  return sorted(filter(lambda (w,c): w not in filtered, counts.items()), \
                key=itemgetter(1), \
                reverse=True)

only_positive_words = only_list(pos_dict, neg_dict)
only_negative_words = only_list(neg_dict, pos_dict)

そして結果は次のとおりです。

>>> only_positive_words
[('the', 4), ('sees', 2), ('dog', 1), ('cat', 1), ('mouse', 1)]
>>> only_negative_words
[('nothing', 1), ('some', 1), ('likes', 1)]

python - nltk.FreqDistの単語を2つのリストに分けますか？

1 に答える 1

Related

Reference