computational-geometry - 高薄暗い空間に動的に挿入されるkNN

Question

私は、高次元の点（通常は〜11-13次元）に対して高速の最近傍（できればO（log n））を実行する方法を探しています。構造を初期化した後、挿入時に最適に動作するようにしたいと思います。KDツリーが頭に浮かびましたが、一括読み込みを行わずに動的挿入を行うと、kdツリーのバランスがとれなくなり、afaikのバランス調整はコストのかかる操作になります。

ですから、そのような設定にどのようなデータ構造を好むのか知りたいと思いました。高次元のポイントがあり、挿入を実行して最も近い隣人を照会したいとします。

score 5 · Accepted Answer

The Curse of Dimensionality gets in the way here. You might consider applying Principal Component Analysis (PCA) to reduce the dimensionality, but as far as I know, nobody has a great answer for this.

I have dealt with this type of problem before (in audio and video fingerprinting), sometimes with up to 30 dimensions. Analysis usually revealed that some of the dimensions did not contain relevant information for searches (actually fuzzy searches, my main goal), so I omitted them from the index structures used to access the data, but included them in the logic to determine matches from a list of candidates found during the search. This effectively reduced the dimensionality to a tractable level.

I simplified things further by quantizing the remaining dimensions severely, such that the entire multidimensional space was mapped into a 32-bit integer. I used this as the key in an STL map (a red-black tree), though I could have used a hash table. I was able to add millions of records dynamically to such a structure (RAM-based, of course) in about a minute or two, and searches took about a millisecond on average, though the data was by no means evenly distributed. Searches required careful enumeration of values in the dimensions that were mapped into the 32-bit key, but were reliable enough to use in a commercial product. I believe it is used to this day in iTunes Match, if my sources are correct. :)

The bottom line is that I recommend you take a look at your data and do something custom that exploits features in it to make for fast indexing and searching. Find the dimensions that vary the most and are the most independent of each other. Quantize those and use them as the key in an index. Each bucket in the index contains all items that share that key (there will likely be more than one). To find nearest neighbors, look at "nearby" keys and within each bucket, look for nearby values. Good luck.

p.s. I wrote a paper on my technique, available here. Sorry about the paywall. Perhaps you can find a free copy elsewhere. Let me know if you have any questions about it.

score 5 · Accepted Answer

頭に浮かぶもう 1 つのデータ構造は、カバーツリーです。もともと範囲クエリに答えるために開発された KD ツリーとは異なり、このデータ構造は最近傍クエリに最適です。これは、すべてのデータポイントの k 個の最近傍を計算する n 体問題で使用されています。このような問題は、密度推定スキーム (パルゼンウィンドウ) でも発生します。あなたの具体的な問題についてはよくわかりませんが、このデータ構造のオンラインバージョンがあることは知っています。アレクサンダー・グレイのページとこのリンクをチェックしてください

score 3 · Accepted Answer

バケットサイズが適度に大きいバケットKdツリーを使用すると、葉がいっぱいになったときにツリーがどこで分割するかをより正確に把握できます。Robocodeのメンバーは、非常に厳しい時間制約の下でこれを実行します。ランダムな挿入がオンザフライで行われ、kNNは1ms未満でk> 80、d> 10、n>30kです。このkD-Treeチュートリアルを確認してください。このチュートリアルでは、一連のkD-Treeの機能強化とその実装方法について説明しています。

score 1 · Accepted Answer

私の経験では、11 ～ 13 のディメンションはそれほど悪くはありません (一括読み込みの場合)。バルクロードされた R ツリー (kd ツリーとは対照的に、これらはバランスが保たれます!) と kd ツリーの両方が、線形スキャンよりもはるかにうまく機能するはずです。

完全に動的になると、私の経験はさらに悪化します。大まかに言うと、一括ロードされたツリーでは 20 倍のスピードアップが見られ、インクリメンタルに構築された R ツリーではわずか 7 倍です。したがって、ツリーを頻繁に再構築することは効果があります。また、データの整理方法によっては、思ったよりもはるかに高速になる場合があります。私が使用している kd-tree の一括読み込みはで、バリアントO(n log n)もあると読みました。O(n log log n)定数係数が低い。R ツリーの場合、Sort-Tile-Recursive はこれまでに見た中で最高のバルクロードでありO(n log n)、定数係数も低くなっています。

そうです、高次元では、時々ツリーをリロードすることを検討します。

computational-geometry - 高薄暗い空間に動的に挿入されるkNN

4 に答える 4

Related

Reference