I was going through some documentations related to HDFS architecture and Apache crunch PTable. Based on my understandings, when we generate PTable the data is internally stored across the Data nodes in HDFS.
This means, if I have PTable with <K1,V1>,<K2,V2>,<K1,V3>,<K3,V4>,<K2,V5>
and two Data nodes D1 and D2 in HDFS.
Let's say each data node has a capacity to hold 3 pairs. So D1 will hold <K1,V1>,<K2,V2>,<K1,V3>
and D2 will hold <K3,V4>,<K2,V5>
.
If I do collectValues on this PTable, I am internally running another map-reduce job to get these values from PTable and generate pairs of <K,Collection<V>>
. So at the end I will have, <K1,Collection<V1,V3>>, <K2,Collection<V2,V5>> and <K3,Collection<V4>>
. And again these pairs will be distributed to different data nodes.
Now, I have this doubt that how will the Collection values (V1,V3 of K1)
be stored in the generated PTable? Will this data be distributed across the nodes too, i.e., will
V1 be stored in D1
V3 be stored in D2
or, V1 and V3 will stored in one node only.
If all the collection values for a key are stored in one node (not-distributed), then for large data sets, won't the processing on the collected values of each key become slow?