4

Update on update:

Solved! See this: MongoDB: cannot iterate through all data with cursor (because data is corrupted)

It's caused by corrupted data set. Not MongoDB or the driver.

=========================================================================

I'm using the latest Java driver(2.11.3) of MongoDB(2.4.6). I've got a collection with ~250M records and I want to use a cursor to iterate through all of them. However, after 10 minutes or so I got either a false cursor.hasNext(), or an exception saying that the cursor does not exist on server.

After that I learned about cursor timeout and wrapped my cursor.next() with try/catch. If any exception, or hasNext() returned false before iterating through all the records, the program closes the cursor and allocates a new one, and then skip right back into context.

But later on I read about cursor.skip() performance issues... And the program just reached ~20M records, and cursor.next() after cursor.skip() throwed out "java.util.NoSuchElementException". I believe that's because the skip operation has timed out, which invalidated the cursor.

Yes I've read about skip() performance issues and cursor timeout issues... But now I think I'm in a dilemma where fixing one will break the other.

So, is there a way to gracefully iterate through all the data in a huge dataset?

@mnemosyn has already pointed out that I have to rely on range-based queries. But the problem is that I want to split all the data into 16 parts and process them on different machines, and the data is not uniformly distributed within any monotonic key space. If load balancing is desired, there must be a way to calculate how many keys are in a particular range and balance them. My goal is to partition them into 16 parts, so I have to find the quartiles of quartiles (sorry, I don't know if there's a mathematical term for this) of the keys and use them to split data.

Is there a way to achieve this?

I do have some ideas when the first seek is achieved by obtaining the partition boundary keys. If the new cursor times out again, I can simply record the latest tweetID and jump back in with the new range. However, the range query should be fast enough or otherwise I still get timeouts. I'm not confident about this...

Update:

Problem solved! I didn't realise that I don't have to partition data in a chunky way. A round-robin job dispatcher will do. See comments in the accepted answer.

4

1 に答える 1

1

一般的に、はい。単調なフィールド、理想的にはインデックス付きフィールドがある場合は、単純にそれに沿って歩くことができます。たとえば、タイプのフィールドをObjectId主キーとして使用している場合、CreatedDateまたは何かがある場合は、単純にクエリを使用して固定数の要素を取得し、前のバッチで遭遇した最小または最小のもの$ltを使用して再度クエリを実行できます.$lt_idCreatedDate

厳密な単調な動作と非厳密な単調な動作には注意してください$lte。キーが厳密でない場合は使用する必要があるかもしれません。_idフィールドは一意であるため、ObjectIds常に厳密に単調です。

そのようなキーを持っていない場合、状況はもう少しトリッキーになります。「インデックスに沿って」反復することもできます (名前、ハッシュ、UUID、Guid など、インデックスに関係なく)。それも同様に機能しますが、トラバースを開始する前に見つけた結果が挿入されたかどうかがわからないため、スナップショットを作成するのは困難です。また、トラバーサルの開始時にドキュメントが挿入されると、それらは見落とされます。

于 2013-10-28T09:31:48.450 に答える