Update on update:
Solved! See this: MongoDB: cannot iterate through all data with cursor (because data is corrupted)
It's caused by corrupted data set. Not MongoDB or the driver.
=========================================================================
I'm using the latest Java driver(2.11.3) of MongoDB(2.4.6). I've got a collection with ~250M records and I want to use a cursor to iterate through all of them. However, after 10 minutes or so I got either a false cursor.hasNext(), or an exception saying that the cursor does not exist on server.
After that I learned about cursor timeout and wrapped my cursor.next() with try/catch. If any exception, or hasNext() returned false before iterating through all the records, the program closes the cursor and allocates a new one, and then skip right back into context.
But later on I read about cursor.skip() performance issues... And the program just reached ~20M records, and cursor.next() after cursor.skip() throwed out "java.util.NoSuchElementException". I believe that's because the skip operation has timed out, which invalidated the cursor.
Yes I've read about skip() performance issues and cursor timeout issues... But now I think I'm in a dilemma where fixing one will break the other.
So, is there a way to gracefully iterate through all the data in a huge dataset?
@mnemosyn has already pointed out that I have to rely on range-based queries. But the problem is that I want to split all the data into 16 parts and process them on different machines, and the data is not uniformly distributed within any monotonic key space. If load balancing is desired, there must be a way to calculate how many keys are in a particular range and balance them. My goal is to partition them into 16 parts, so I have to find the quartiles of quartiles (sorry, I don't know if there's a mathematical term for this) of the keys and use them to split data.
Is there a way to achieve this?
I do have some ideas when the first seek is achieved by obtaining the partition boundary keys. If the new cursor times out again, I can simply record the latest tweetID and jump back in with the new range. However, the range query should be fast enough or otherwise I still get timeouts. I'm not confident about this...
Update:
Problem solved! I didn't realise that I don't have to partition data in a chunky way. A round-robin job dispatcher will do. See comments in the accepted answer.