python - Python - 私の頻度関数は非効率的です

Question

単語のリストに最も多く出現した単語の出現回数を返す関数を書いています。

def max_frequency(words):
    """Returns the number of times appeared of the word that
    appeared the most in a list of words."""

    words_set = set(words)
    words_list = words
    word_dict = {}

    for i in words_set:
        count = []
        for j in words_list:
            if i == j:
                count.append(1)
        word_dict[i] = len(count)

    result_num = 0
    for _, value in word_dict.items():
        if value > result_num:
            result_num = value
    return result_num

例えば：

words = ["Happy", "Happy", "Happy", "Duck", "Duck"]
answer = max_frequency(words)
print(answer)

3

ただし、リスト内の大量の単語を処理する場合、この関数は遅くなります。たとえば、250,000 単語のリストでは、この関数が出力を表示するのに 4 分以上かかります。だから私はこれを微調整するために助けを求めています。

何も輸入したくありません。

score 3 · Accepted Answer

一意の単語ごとにリストが何度も渡されるのを防ぐには、リストを 1 回繰り返して、カウントごとに辞書の値を更新するだけです。

counts = {}
for word in words:
    counts[word] = counts.get(word, 0) + 1

出力:

>>> print(max(counts.values()))
3

defaultdictそうは言っても、これは a の代わりに、getまたは...を使用すると、はるかにcollections.Counterうまく実行できます。選択できる場合、Python でインポートを行わないように制限することは、決して良い考えではありません。

たとえば、次を使用しcollections.Counterます。

from collections import Counter
counter = Counter(words)
most_common = counter.most_common(1)

score 0 · Accepted Answer

OPと同様のデータサイズ

言葉のリストから始めましょう

In [55]: print(words)
['oihwf', 'rpowthj', 'trhok', 'rtpokh', 'tqhpork', 'reaokp', 'eahopk', 'qeaopker', 'okp[qrg', 'okehtq', 'pinjjn', 'rq38na', 'aogopire', "apoe'ak", 'apfobo;444', 'jiaegro', '908qymar', 'pe9irmp4', 'p9itoijar', 'oijor8']

これらの単語をランダムに組み合わせてテキストを作成します

In [56]: from random import choice
In [57]: text = ' '.join(choice(words) for _ in range(250000))

さまざまな方法が可能です

テキストから、テキスト内の単語のリストを取得できます (注:は ...wlとは大きく異なりwordsます)。

In [58]: wl = text.split()

このリストから辞書、または辞書のようなオブジェクトを抽出し、出現回数を指定します。多くのオプションがあります。

最初のオプションでは、すべての異なる単語を含む辞書を作成し、wl各キーの値をゼロに設定し、後で単語のリストで別のループを実行して出現回数をカウントします

In [59]: def count0(wl):
    wd = dict(zip(wl,[0]*len(wl)))
    for w in wl: wd[w] += 1            
    return wd
   ....:

2 番目のオプションでは、空の辞書から始めてget()、デフォルト値を許可する方法を使用します。

In [60]: def count1(wl):
    wd = dict()                   
    for w in wl: wd[w] = wd.get(w, 0)+1
    return wd
   ....:

3 番目と最後のオプションでは、標準ライブラリのコンポーネントを使用します

In [61]: def count2(wl):
    from collections import Counter
    wc = Counter(wl)
    return wc
   ....:

ある方法は他の方法よりも優れていますか?

どれが最高ですか？一番好きな方…ともかく、それぞれのタイミングはこちら

In [62]: %timeit count0(wl) # start with a dict with 0 values
10 loops, best of 3: 82 ms per loop

In [63]: %timeit count1(wl) # uses .get(key, 0)
10 loops, best of 3: 92 ms per loop

In [64]: %timeit count2(wl) # uses collections.Counter
10 loops, best of 3: 43.8 ms per loop

予想どおり、最速の手順はを使用するものですがcollections.Counter、データを TWO パスする最初のオプションが 2 番目のオプションよりも高速であることに気付いて少し驚きました...私の推測 (つまり、推測)新しい値のすべてのテストは、辞書のインスタンス化中に、おそらく何らかのCコード内で行われるということです。

score 0 · Accepted Answer

約 760% 高速なこのコードを試すことができます。

def max_frequency(words):
    """Returns the number of times appeared of the word that
    appeared the most in a list of words."""

    count_dict = {}
    max = 0

    for word in words:
        current_count = 0

        if word in count_dict:
            current_count = count_dict[word] = count_dict[word] + 1
        else:
            current_count = count_dict[word] = 1

        if current_count > max:
            max = current_count

    return max

python - Python - 私の頻度関数は非効率的です

4 に答える 4

OPと同様のデータサイズ

さまざまな方法が可能です

ある方法は他の方法よりも優れていますか?

Related

Reference