python - 値に基づいて辞書の上位kキーを効率的に追跡する

Question

辞書のキーが更新されている間に、値が最大の辞書の上位kキーを効率的に追跡するにはどうすればよいですか？

更新のたびに辞書からソートされたリストを作成するという素朴なアプローチを試しましたが（辞書で最大値のキーを取得するで説明されていますか？）、それは非常に高価であり、拡張性がありません。

実例：

データの無限のストリームから来る単語の頻度を数えます。いつでも、プログラムは、単語が現在の上位kの最も頻繁な値にあるかどうかを報告するように求められる場合があります。これを効率的に達成するにはどうすればよいですか？

collections.Counterが遅すぎます

>>> from itertools import permutations
>>> from collections import Counter
>>> from timeit import timeit
>>> c = Counter()
>>> for x in permutations(xrange(10), 10):
    c[x] += 1


>>> timeit('c.most_common(1)', 'from __main__ import c', number=1)
0.7442058258093311
>>> sum(c.values())
3628800

この値を計算するのに1秒近くかかります！

関数のO（1）時間を探してい most_common()ます。これは、現在の上位k項目のみを内部的に格納し、現在の最小値を追跡する別のデータ構造を持つことで実行できるはずです。

score 2 · Accepted Answer

collections.Counter.most_common すべての値をパスし、N 番目に大きい値を見つけてヒープに入れます(O(M log N) 時間、M は辞書要素の総数です)。

heapq、コメントで Wei Yen が示唆しているように、問題なく動作する可能性があります。辞書と並行してheapq、N 個の最大値を維持し、辞書を変更するときに、値がそこにあるかどうか、または現在そこにある必要があるかどうかを確認します。問題は、あなたが指摘したように、インターフェイスには、すでに-既存の要素。

関連するアイテムをその場で変更してから実行heapq.heapifyして、ヒープネスを復元できます。これには、関連するアイテムを見つけるためにヒープ (N) のサイズの線形パスが必要です (要素を位置に関連付けるために追加の簿記を行っている場合を除きます。おそらく価値はありません)。リストに含まれていなかった要素が現在含まれている場合は、最小の要素を置き換えてヒープに追加する必要があります (線形時間で、追加の構造を除いて)。

ただし、heapq プライベートインターフェイスには、_siftdown次のコメントを持つ関数が含まれています。

# 'heap' is a heap at all indices >= startpos, except possibly for pos.  pos
# is the index of a leaf with a possibly out-of-order value.  Restore the
# heap invariant.

いいですね！呼び出すheapq._siftdown(heap, 0, pos_of_relevant_idx)と、ログ N 時間でヒープが修正されます。もちろん、最初にインクリメントするインデックスの位置を見つける必要がありますが、これには線形時間がかかります。それを回避するために、要素の辞書をインデックスに保持する可能性があります（最小要素の位置へのポインターも保持します）が、ソースをコピーして_siftdown変更し、スワップ時に辞書を更新する必要がありますまたは、後で辞書を再構築するために線形タイムパスを実行します（ただし、線形パスを回避しようとしていただけです...）。

注意してください、これは O(log N) 時間でうまくいくはずです。ただし、フィボナッチヒープと呼ばれるものがあり、必要なすべての操作を (償却された)一定時間でサポートします。残念ながら、これはビッグオーがすべてではないケースの 1 つです。フィボナッチヒープの複雑さは、実際には、おそらく非常に大きなヒープを除いて、実際にはバイナリヒープよりも高速ではないことを意味します。また (おそらく「そのため」)、Boost C++ ライブラリには標準の Python 実装が含まれていますが、簡単なグーグル検索で見つけた標準の Python 実装はありません。

最初にを使用しheapqて、変更している要素の線形検索を行い、_siftdown;を呼び出してみます。Counterアプローチの O(M log N) と比較して、これは O(N) 時間です。それが遅すぎることが判明した場合は、インデックスの追加の辞書を維持し、独自のバージョンの辞書を作成_siftdownして dict を更新することができます。これには O(log N) 時間かかるはずです。それでも遅すぎる場合(私には疑わしい)、Boost の Fibonacci ヒープ (または別の実装) への Python ラッパーを探すことができますが、これが手間をかける価値があるとは思えません。

score 1 · Accepted Answer

collections.Counterその実世界の例では、すでにそれを使用しています。他のユースケースはありますか？

score 0 · Accepted Answer

標準ライブラリにこれが組み込まれているとは思わないので、上位 k 値を追跡するクラスを実装できます。これは、メインのディクショナリオブジェクト (おそらく)と並行して最新の状態に保たれますCounter。これをメイン辞書オブジェクトのサブクラスの属性として使用することもできます。

実装

class MostCommon(object):
    """Keep track the top-k key-value pairs.

    Attributes:
        top: Integer representing the top-k items to keep track of.
        store: Dictionary of the top-k items.
        min: The current minimum of any top-k item.
        min_set: Set where keys are counts, and values are the set of
            keys with that count.
    """
    def __init__(self, top):
        """Create a new MostCommon object to track key-value paris.

        Args:
            top: Integer representing the top-k values to keep track of.
        """
        self.top = top
        self.store = dict()
        self.min = None
        self.min_set = defaultdict(set)

    def _update_existing(self, key, value):
        """Update an item that is already one of the top-k values."""
        # Currently handle values that are non-decreasing.
        assert value > self.store[key]
        self.min_set[self.store[key]].remove(key)
        if self.store[key] == self.min:  # Previously was the minimum.
            if not self.min_set[self.store[key]]:  # No more minimums.
                del self.min_set[self.store[key]]
                self.min_set[value].add(key)
                self.min = min(self.min_set.keys())
        self.min_set[value].add(key)
        self.store[key] = value

    def __contains__(self, key):
        """Boolean if the key is one of the top-k items."""
        return key in self.store

    def __setitem__(self, key, value):
        """Assign a value to a key.

        The item won't be stored if it is less than the minimum (and
        the store is already full). If the item is already in the store,
        the value will be updated along with the `min` if necessary.
        """
        # Store it if we aren't full yet.
        if len(self.store) < self.top:
            if key in self.store:  # We already have this item.
                self._update_existing(key, value)
            else:  # Brand new item.
                self.store[key] = value
                self.min_set[value].add(key)
                if value < self.min or self.min is None:
                    self.min = value
        else:  # We're full. The value must be greater minimum to be added.
            if value > self.min:  # New item must be larger than current min.
                if key in self.store:  # We already have this item.
                    self._update_existing(key, value)
                else:  # Brand new item.
                    # Make room by removing one of the current minimums.
                    old = self.min_set[self.min].pop()
                    del self.store[old]
                    # Delete the set if there are no old minimums left.
                    if not self.min_set[self.min]:
                        del self.min_set[self.min]
                    # Add the new item.
                    self.min_set[value].add(key)
                    self.store[key] = value
                    self.min = min(self.min_set.keys())

    def __repr__(self):
        if len(self.store) < 10:
            store = repr(self.store)
        else:
            length = len(self.store)
            largest = max(self.store.itervalues())
            store = '<len={length}, max={largest}>'.format(length=length,
                                                           largest=largest)
        return ('{self.__class__.__name__}(top={self.top}, min={self.min}, '
                'store={store})'.format(self=self, store=store))

使用例

>>> common = MostCommon(2)
>>> common
MostCommon(top=2, min=None, store={})
>>> common['a'] = 1
>>> common
MostCommon(top=2, min=1, store={'a': 1})
>>> 'a' in common
True
>>> common['b'] = 2
>>> common
MostCommon(top=2, min=1, store={'a': 1, 'b': 2})
>>> common['c'] = 3
>>> common
MostCommon(top=2, min=2, store={'c': 3, 'b': 2})
>>> 'a' in common
False
>>> common['b'] = 4
>>> common
MostCommon(top=2, min=3, store={'c': 3, 'b': 4})

値を更新した後のアクセスは確かにO(1)です

>>> counter = Counter()
>>> for x in permutations(xrange(10), 10):
        counter[x] += 1

>>> common = MostCommon(1)
>>> for key, value in counter.iteritems():
    common[key] = value

>>> common
MostCommon(top=1, min=1, store={(9, 7, 8, 0, 2, 6, 5, 4, 3, 1): 1})
>>> timeit('repr(common)', 'from __main__ import common', number=1)
1.3251570635475218e-05

アクセスは O(1) ですが、O(n) 操作である set-item 呼び出し中に最小値が変更された場合、nは上位値の数です。Counterこれは、アクセスごとに O(n) であるよりも優れています。ここnで、は語彙全体のサイズです!

python - 値に基づいて辞書の上位kキーを効率的に追跡する

実例：

collections.Counterが遅すぎます

3 に答える 3

実装

使用例

値を更新した後のアクセスは確かにO(1)です

Related

Reference