python - ドキュメント内で検出された単語頻度の累積カウントの取得

Question

テキストの単語/バイグラムの傾向を検出しようとしています。私がこれまでに行ったことは、ストップワードを削除し、単語の頻度を小文字にして取得し、テキストごとに上位 30 個をリストに追加することです。

例えば

[(u'seeing', 2), (u'said.', 2), (u'one', 2), (u'death', 2), (u'entertainment',   2), (u'it\u2019s', 2), (u'weiss', 2), (u'read', 2), (u'\u201cit', 1), (u'shot', 1), (u'show\u2019s', 1), (u'people', 1), (u'dead,\u201d', 1), (u'bloody', 1),...]

次に、上記のリストを、すべての単語とドキュメントごとの頻度を含む 1 つの巨大なリストに変換しました。次に行う必要があるのは、並べ替えられたリストを取得することです。

[(u'snow', 32), (u'said.', 12), (u'GoT', 10), (u'death', 8), (u'entertainment', 4)..]

何か案は？

コード：

fdists = []
for i in texts:
    words = FreqDist(w.lower() for w in i.split() if w.lower() not in    stopwords)
    fdists.append(words.most_common(30))

all_in_one = [item for sublist in fdists for item in sublist]

score 0 · Accepted Answer

あなたがしたいのはあなたが使うことができるリストをソートすることだけなら

import operator

fdists = [(u'seeing', 2), (u'said.', 2), (u'one', 2), (u'death', 2), (u'entertainment',   2), (u'it\u2019s', 2), (u'weiss', 2), (u'read', 2), (u'\u201cit', 1), (u'shot', 1), (u'show\u2019s', 1), (u'people', 1), (u'dead,\u201d', 1), (u'bloody', 1)]
fdists2 = [(u'seeing', 3), (u'said.', 4), (u'one', 2), (u'death', 2), (u'entertainment',   2), (u'it\u2019s', 2), (u'weiss', 2), (u'read', 2)]
fdists += fdists2

fdict = {}
for i in fdists:
    if i[0] in fdict:
        fdict[i[0]] += i[1]
    else:
        fdict[i[0]] = i[1]

sorted_f = sorted(fdict.items(), key=operator.itemgetter(1), reverse=True)
print sorted_f[:30]

[(u'said.', 6), (u'seeing', 5), (u'death', 4), (u'entertainment', 4), (u'read', 4), (u'it\u2019s', 4), (u'weiss', 4), (u'one', 4), (u'\u201cit', 1), (u'shot', 1), (u'show\u2019s', 1), (u'people', 1), (u'dead,\u201d', 1), (u'bloody', 1)]

重複を処理する別の方法は、pandasgroupby()関数を使用してから、その関数を使用しsort()てソートすることです。countword

from pandas import *
import pandas as pd

fdists = [(u'seeing', 2), (u'said.', 2), (u'one', 2), (u'death', 2), (u'entertainment',   2), (u'it\u2019s', 2), (u'weiss', 2), (u'read', 2), (u'\u201cit', 1), (u'shot', 1), (u'show\u2019s', 1), (u'people', 1), (u'dead,\u201d', 1), (u'bloody', 1)]
fdists2 = [(u'seeing', 3), (u'said.', 4), (u'one', 2), (u'death', 2), (u'entertainment',   2), (u'it\u2019s', 2), (u'weiss', 2), (u'read', 2)]
fdists += fdists2

df = DataFrame(data = fdists, columns = ['word','count'])
df= DataFrame([{'word': k, 'count': (v['count'].sum())} for k,v in df.groupby(['word'])], columns = ['word','count'])

Sorted = df.sort(['count','word'], ascending = [0,1])
print Sorted[:30]

             word  count
8           said.      6
9          seeing      5
2           death      4
3   entertainment      4
4            it’s      4
5             one      4
7            read      4
12          weiss      4
0          bloody      1
1          dead,”      1
6          people      1
10           shot      1
11         show’s      1
13            “it      1

python - ドキュメント内で検出された単語頻度の累積カウントの取得

1 に答える 1

Related

Reference