When crating a new EMR Job with an S3 bucket as the input source, is the data automatically copied from S3 into HDFS on the nodes? Or does the data remain solely in S3 and read when needed by map reduce jobs?
I get the impressions its the latter; but if data is stored in S3 and the processing done on provisioned EC2 instances, does this not go against a fundamental principle of map reduce: doing processing local to the data? As opposed to a more a traditional system: moving the data to where the processing is.
What are the relative implications of this approach given a reasonable large data set such as 1PB e.g. does the cluster take longer to start?