string - Best way to match 4 million rows of data against each other and sort results by similarity?

Question

We use libpuzzle ( http://www.pureftpd.org/project/libpuzzle/doc ) to compare 4 million images against each other for similarity.

It works quite well.

But rather then doing a image vs image compare using the libpuzzle functions, there is another method of comparing the images.

Here is some quick background:

Libpuzzle creates a rather small (544 bytes) hash of any given image. This hash can in turn be used to compare against other hashes using libpuzzles functions. There are a few APIs... PHP, C, etc etc... We are using the PHP API.

The other method of comparing the images is by creating vectors from the given hash, here is a paste from the docs:

Cut the vector in fixed-length words. For instance, let's consider the following vector:

[ a b c d e f g h i j k l m n o p q r s t u v w x y z ]

With a word length (K) of 10, you can get the following words:

[ a b c d e f g h i j ] found at position 0 [ b c d e f g h i j k ] found at position 1 [ c d e f g h i j k l ] found at position 2 etc. until position N-1

Then, index your vector with a compound index of (word + position).

Even with millions of images, K = 10 and N = 100 should be enough to have very little entries sharing the same index.

So, we have the vector method working. Its actually works a bit better then the image vs image compare since when we do the image vs image compare, we use other data to reduce our sample size. Its a bit irrelevant and application specific what other data we use to reduce the sample size, but with the vector method... we would not have to do so, we could do a real test of each of the 4 million hashes against each other.

The issue we have is as follows:

With 4 million images, 100 vectors per image, this becomes 400 million rows. We have found MySQL tends to choke after about 60000 images (60000 x 100 = 6 million rows).

The query we use is as follows:

SELECT isw.itemid, COUNT(isw.word) as strength
FROM vectors isw
JOIN vectors isw_search ON isw.word = isw_search.word
WHERE isw_search.itemid = {ITEM ID TO COMPARE AGAINST ALL OTHER ENTRIES}
GROUP BY isw.itemid;

As mentioned, even with proper indexes, the above is quite slow when it comes to 400 million rows.

So, can anyone suggest any other technologies / algos to test these for similarity?

We are willing to give anything a shot.

Some things worth mentioning:

Hashes are binary.
Hashes are always the same length, 544 bytes.

The best we have been able to come up with is:

Convert image hash from binary to ascii.
Create vectors.
Create a string as follows: VECTOR1 VECTOR2 VECTOR3 etc etc.
Search using sphinx.

We have not yet tried the above, but this should probably yield a bit better results than doing the mysql query.

Any ideas? As mentioned, we are willing to install any new service (postgresql? hadoop?).

Final note, an outline of exactly how this vector + compare method works can be found in question Libpuzzle Indexing millions of pictures?. We are in essence using the exact method provided by Jason (currently the last answer, awarded 200+ so points).

score 0 · Accepted Answer

データベースでこれを行わないでください。単純なファイルを使用してください。[abcdefghijklmnopqrst]以下に、2 つのベクトル(画像 1) と[xxcdefghijklxxxxxxxx](画像 2)からの単語の一部を含むファイルを示します。

 <index>       <image>
0abcdefghij      1
1bcdefghijk      1
2cdefghijkl      1
3defghijklm      1
4efghijklmn      1
...
...
0xxcdefghij      2
1xcdefghijk      2
2cdefghijkl      2
3defghijklx      2
4efghijklxx      2
...

ファイルを並べ替えます。

  <index>       <image>
0abcdefghij      1
0xxcdefghij      2
1bcdefghijk      1
1xcdefghijk      2
2cdefghijkl      1       
2cdefghijkl      2       <= the index is repeated, those we have a match
3defghijklm      1
3defghijklx      2
4efghijklmn      1
4efghijklxx      2

ファイルがソートされると、同じインデックスを持つレコードを簡単に見つけることができます。ソートされたリストを実行して重複を見つけることができる小さなプログラムまたは何かを作成します。

score 0 · Accepted Answer

非常にうまく機能する解決策を見つけたので、「自分自身の質問に答える」ことを選択しました。

最初の質問で、スフィンクス検索でこれを行うことを考えていると言いました。

さて、私たちは先に進んでそれを実行しましたが、結果はmysqlを介してこれを実行するよりもはるかに優れています.

したがって、本質的にプロセスは次のようになります。

a) 画像からハッシュを生成します。

b) このハッシュを 100 の部分に「ベクトル化」します。

c) binhex (2 進数から 16 進数) は、これらのベクトルがバイナリ形式であるためです。

d) 次のようにスフィンクス検索に保存します。

アイテム ID | 0_vector0 1_vector1 2_vec... など

e) スフィンクス検索を使用して検索します。

最初は... この sphinxbase が 400 万件のレコードでいっぱいになると、検索ごとに約 1 秒かかります。

次に、8 コアでこの sphinxbase の分散インデックス作成を有効にしました。現在、1 秒あたり約 10 件以上の検索を実行しようとしています。これで十分です。

最後の 1 つのステップは、この sphinxbase を複数のサーバーにさらに分散させ、利用可能な未使用の CPU サイクルをさらに活用することです。

しかし、とりあえず、それで十分です。1 日あたり約 1000 ～ 2000 個の「アイテム」を追加するため、「新しいアイテムのみ」の検索は非常に迅速に行われます...最初のスキャンを行った後です。

string - Best way to match 4 million rows of data against each other and sort results by similarity?

2 に答える 2

Related

Reference