I've been reading up on MongoDB. I am particularly interested in the aggregation frameworks ability. I am looking at taking multiple dataset consisting of at least 10+ million rows per month and creating aggregations off of this data. This is time series data.
Example. Using Oracle OLAP, you can load data at the second/minute level and have this roll up to hours, days, weeks, months, quarters, years etc...simply define your dimensions and go from there. This works quite well.
So far I have read that MongoDB can handle the above using it's map reduce functionality. Map reduce functionality can be implemented so that it updates results incrementally. This makes sense since I would be loading new data say weekly or monthly and I would expect to only have to process new data that is being loaded.
I have also read that map reduce in MongoDB can be slow. To overcome this, the idea is to use a cheap commodity hardware and spread the load across multiple machines.
So here are my questions.
- How good (or bad) does MongoDB handle map reduce in terms of performance? Do you really need a lot of machines to get acceptable performance?
- In terms of workflow, is it relatively easy to store and merge the incremental results generated by map reduce?
- How much of a performance improvement does the aggregation framework offer?
- Does the aggregation framework offer the ability to store results incrementally in a similar manner that the map/reduce functionality that already exists does.
I appreciate your responses in advance!