hadoop - How does Apache Crunch PTable collectValues work internally

翻译自：https://stackoverflow.com/questions/36890242 2016-04-27T12:45:05.297

207 次

I was going through some documentations related to HDFS architecture and Apache crunch PTable. Based on my understandings, when we generate PTable the data is internally stored across the Data nodes in HDFS.

This means, if I have PTable with <K1,V1>,<K2,V2>,<K1,V3>,<K3,V4>,<K2,V5> and two Data nodes D1 and D2 in HDFS. Let's say each data node has a capacity to hold 3 pairs. So D1 will hold <K1,V1>,<K2,V2>,<K1,V3> and D2 will hold <K3,V4>,<K2,V5>.

If I do collectValues on this PTable, I am internally running another map-reduce job to get these values from PTable and generate pairs of <K,Collection<V>>. So at the end I will have, <K1,Collection<V1,V3>>, <K2,Collection<V2,V5>> and <K3,Collection<V4>>. And again these pairs will be distributed to different data nodes.

Now, I have this doubt that how will the Collection values (V1,V3 of K1) be stored in the generated PTable? Will this data be distributed across the nodes too, i.e., will

V1 be stored in D1
V3 be stored in D2

or, V1 and V3 will stored in one node only.

If all the collection values for a key are stored in one node (not-distributed), then for large data sets, won't the processing on the collected values of each key become slow?

hadoop - How does Apache Crunch PTable collectValues work internally

1 に答える 1

Related

Reference