hadoop - 重複キーのフィルタリング

Question

大量のキーをリアルタイムでスクリーニング/フィルタリングする分散ソリューションを探しています。私のアプリケーションは 1 日あたり 1,000 億を超えるレコードを生成しており、ストリームから重複を除外する方法が必要です。キーあたり約 100 バイトで、ローリング 10 日分のキーを格納するシステムを探しています。Hadoop を使用する前に、この種の大規模な問題がどのように解決されたのか疑問に思っていました。HBase は正しいソリューションでしょうか? Zookeeper のような部分的にインメモリのソリューションを試した人はいますか?

score 4 · Accepted Answer

I can see a number of solutions to your problem, but the real-time requirement really narrows it down. By real-time do you mean you want to see if a key is a duplicate as its being created?

Let's talk about queries per second. You say 100B/day (that's a lot, congratulations!). That's 1.15 Million queries per second (100,000,000,000 / 24 / 60 / 60). I'm not sure if HBase can handle that. You may want to think about something like Redis (sharded perhaps) or Membase/memcached or something of that sort.

If you were to do it in HBase, I'd simply push the upwards of a trillion keys (10 days x 100B keys) as the keys in the table, and put some value in there to store it (because you have to). Then, you can just do a get to figure out if the key is in there. This is kind of hokey and doesn't fully utilize hbase as it is only fully utilizing the keyspace. So, effectively HBase is a b-tree service in this case. I don't think this is a good idea.

If you relax the restraint to not have to do real-time, you could use MapReduce in batch to dedup. That's pretty easy: it's just Word Count without the counting. You group by the key you have and then you'll see the dups in the reducer if multiple values come back. With enough nodes an enough latency, you can solve this problem efficiently. Here is some example code for this from the MapReduce Design Patterns book: https://github.com/adamjshook/mapreducepatterns/blob/master/MRDP/src/main/java/mrdp/ch3/DistinctUserDriver.java

ZooKeeper is for distributed process communication and synchronization. You don't want to be storing trillions of records in zookeeper.

So, in my opinion, you're better served by a in-memory key/value store such as redis, but you'll be hard pressed to store that much data in memory.

score 1 · Accepted Answer

残念ながら、従来のシステムでは不可能です :|

これがUが言及したものです：

1 日あたり 1000 億は、1 秒あたり約 100 万を意味します!!!!
キーのサイズは 100 バイトです。
U が 10 日間のワーキングセットで重複をチェックしたいということは、1 兆のアイテムを意味します。

これらの仮定により、全体で 90 テラバイトのサイズの 1 兆個のオブジェクトのセットがルックアップされます!!!!! このリアルタイムの問題を解決するには、この量のデータで 1 秒あたり 100 万項目を検索できるシステムを提供する必要があります。HBase、Cassandra、Redis、および Memcached の経験があります。U は、HBase、Cassandra、HyperTable などのディスクベースのストレージではこのパフォーマンスを達成できないと確信しています (さらに、MySQL、PostgreSQl などの RDBMS をこれらに追加します)。実際に私が聞いた redis と memcached の最高のパフォーマンスは、1 台のマシンで 1 秒あたり約 10 万回の操作です。これは、それぞれ 1 テラバイトの RAM を持つ 90 台のマシンが必要であることを意味します!!!!!!!!
Hadoop のようなバッチ処理システムでさえ、この作業を 1 時間以内に行うことはできません。100 台のマシンからなる大規模なクラスターでも、数時間から数日かかると思います。

UR は非常に大きな数字 (90 TB、毎秒 1M) について話しています。RUはこれについて確信がありますか???

hadoop - 重複キーのフィルタリング

2 に答える 2

Related

Reference