python - 文字列内の単語の頻度を効率的に計算する

Question

長いテキスト文字列を解析し、Python で各単語が出現する回数を計算しています。私は機能する関数を持っていますが、それをより効率的にする方法があるかどうか (速度の点で) があるかどうか、およびこれを実行できる Python ライブラリ関数があるかどうかについてのアドバイスを探しているので、車輪の再発明はしていません。 ?

長い文字列 (通常、文字列内の 1000 語以上) に出現する最も一般的な単語を計算するためのより効率的な方法を提案できますか?

また、最初の要素が最も一般的な単語、2 番目の要素が 2 番目に一般的な単語などのリストに辞書を並べ替える最良の方法は何ですか?

test = """abc def-ghi jkl abc
abc"""

def calculate_word_frequency(s):
    # Post: return a list of words ordered from the most
    # frequent to the least frequent

    words = s.split()
    freq  = {}
    for word in words:
        if freq.has_key(word):
            freq[word] += 1
        else:
            freq[word] = 1
    return sort(freq)

def sort(d):
    # Post: sort dictionary d into list of words ordered
    # from highest freq to lowest freq
    # eg: For {"the": 3, "a": 9, "abc": 2} should be
    # sorted into the following list ["a","the","abc"]

    #I have never used lambda's so I'm not sure this is correct
    return d.sort(cmp = lambda x,y: cmp(d[x],d[y]))

print calculate_word_frequency(test)

score 40 · Accepted Answer

使用collections.Counter：

>>> from collections import Counter
>>> test = 'abc def abc def zzz zzz'
>>> Counter(test.split()).most_common()
[('abc', 2), ('zzz', 2), ('def', 2)]

score 6 · Accepted Answer

>>>> test = """abc def-ghi jkl abc
abc"""
>>> from collections import Counter
>>> words = Counter()
>>> words.update(test.split()) # Update counter with words
>>> words.most_common()        # Print list with most common to least common
[('abc', 3), ('jkl', 1), ('def-ghi', 1)]

score 3 · Accepted Answer

NLTK(Natural Language ToolKit)を使用することもできます。テキストの処理を研究するための非常に優れたライブラリを提供します。この例では、次を使用できます。

from nltk import FreqDist

text = "aa bb cc aa bb"
fdist1 = FreqDist(text)

# show most 10 frequent word in the text
print fdist1.most_common(10)

結果は次のようになります。

[('aa', 2), ('bb', 2), ('cc', 1)]

python - 文字列内の単語の頻度を効率的に計算する

4 に答える 4

Related

Reference