java - マッパーとレデューサーにJavaコードを使用するEMRストリーミングジョブ

Question

私は現在、ルビーで書かれたマッパーとレデューサーのコードでストリーミングジョブを実行しています。これらをJavaに変換したいと思います。Javaを使用してEMRHadoopでストリーミングジョブを実行する方法がわかりません。アマゾンのEMRウェブサイトのcloudburstにあるサンプルは複雑すぎます。以下は、私が現在どのようにジョブを実行しているかの詳細です。

ジョブを開始するためのコード：

        elastic-mapreduce --create --alive --plain-output --master-instance-type m1.small 
--slave-instance-type m1.xlarge --num-instances 2  --name "Job Name" --bootstrap-action 
    s3://bucket-path/bootstrap.sh

ステップを追加するコード：

    elastic-mapreduce -j <job_id> --stream --step-name "my_step_name" 
--jobconf mapred.task.timeout=0 --mapper s3://bucket-path/mapper.rb 
--reducer s3://bucket-path/reducerRules.rb --cache s3://bucket-path/cache/cache.txt 
--input s3://bucket-path/input --output s3://bucket-path/output

マッパーコードは、EMRのキャッシュ引数として上記で言及されているcsvファイルから読み取るだけでなく、いくつかのcsvファイルも含む入力s3バケットから読み取り、いくつかの計算を実行し、csv出力行を標準出力に出力します。

//mapper.rb 
CSV_OPTIONS  = {
  // some CSV options
}

begin
    file = File.open("cache.txt")
    while (line = file.gets)
        // do something
    end
    file.close
end

input  = FasterCSV.new(STDIN, CSV_OPTIONS)
input.each{ 
// do calculations and get result
puts (result)
}

//reducer.rb

$stdin.each_line do |line|
// do some aggregations and get aggregation_result
if(some_condition) puts(aggregation_result)
end

score 0 · Accepted Answer

Javaを使用している場合は、ストリーミングを使用しません。MapReduceAPIに対して直接Jarを構築します。

悪名高い単語数など、これを行う方法のいくつかの良い例については、hadoopソースのexamplesフォルダーを確認してください： https ：//github.com/apache/hadoop/tree/trunk/src/examples/org/apache/hadoop/例

なぜJavaを使用したいかは完全にはわかりませんが、APIに直接コーディングするのは面倒な場合があります。次のいずれかを試してみてください。Javaプロジェクト：

カスケードhttp://www.cascading.org/
クランチhttp://www.cloudera.com/blog/2011/10/introducing-crunch/

非Java：

Hive（sql-like）https://cwiki.apache.org/confluence/display/Hive/Home
Pig http://pig.apache.org/#Getting+Started
Scoobi（scala）https://github.com/NICTA/scoobi

FWIWおそらくPigが私の選択であり、EMRですぐにサポートされると思います。

score 0 · Accepted Answer

今から私はHadoopとMapreduceの拠点が増えたので、これが私が期待していたことです。

クラスターを開始するために、コードは質問とほぼ同じままですが、構成パラメーターを追加できます。

ruby elastic-mapreduce --create --alive --plain-output --master-instance-type m1.xlarge --slave-instance-type m1.xlarge --num-instances 11  --name "Java Pipeline" --bootstrap-action s3://elasticmapreduce/bootstrap-actions/install-ganglia --bootstrap-action s3://elasticmapreduce/bootstrap-actions/configure-hadoop --args "--mapred-config-file, s3://com.versata.emr/conf/mapred-site-tuned.xml"

ジョブステップを追加するには：

ステップ1：

ruby elastic-mapreduce --jobflow <jobflo_id> --jar s3://somepath/job-one.jar --arg s3://somepath/input-one --arg s3://somepath/output-one --args -m,mapred.min.split.size=52880 -m,mapred.task.timeout=0

ステップ2：

ruby elastic-mapreduce --jobflow <jobflo_id> --jar s3://somepath/job-two.jar --arg s3://somepath/output-one --arg s3://somepath/output-two --args -m,mapred.min.split.size=52880 -m,mapred.task.timeout=0

ここで、Javaコードに関しては、次のクラスのそれぞれに1つの実装を含む1つのMainクラスがあります。

org.apache.hadoop.mapreduce.Mapper;
org.apache.hadoop.mapreduce.Reducer;

これらのそれぞれは、目的の仕事をするためにメソッドmap（）とreduce（）をオーバーライドする必要があります。

問題の問題のJavaクラスは次のようになります。

public class SomeJob extends Configured implements Tool {

    private static final String JOB_NAME = "My Job";

    /**
     * This is Mapper.
     */
    public static class MapJob extends Mapper<LongWritable, Text, Text, Text> {

        private Text outputKey = new Text();
        private Text outputValue = new Text();

        @Override
        protected void setup(Context context) throws IOException, InterruptedException {

            // Get the cached file
            Path file = DistributedCache.getLocalCacheFiles(context.getConfiguration())[0];

            File fileObject = new File (file.toString());
            // Do whatever required with file data
        }

        @Override
        public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
            outputKey.set("Some key calculated or derived");
            outputVey.set("Some Value calculated or derived");
            context.write(outputKey, outputValue);
        }
        }

    /**
     * This is Reducer.
     */
    public static class ReduceJob extends Reducer<Text, Text, Text, Text> {

    private Text outputKey = new Text();
    private Text outputValue = new Text();

        @Override
        protected void reduce(Text key, Iterable<Text> values, Context context) throws IOException,
                InterruptedException {
            outputKey.set("Some key calculated or derived");
            outputVey.set("Some Value calculated or derived");
            context.write(outputKey, outputValue);
        }
    }

    @Override
    public int run(String[] args) throws Exception {

        try {
            Configuration conf = getConf();
            DistributedCache.addCacheFile(new URI(args[2]), conf);
            Job job = new Job(conf);

            job.setJarByClass(TaxonomyOverviewReportingStepOne.class);
            job.setJobName(JOB_NAME);

            job.setMapperClass(MapJob.class);
            job.setReducerClass(ReduceJob.class);
            job.setOutputKeyClass(Text.class);
            job.setOutputValueClass(Text.class);

            job.setInputFormatClass(TextInputFormat.class);
            job.setOutputFormatClass(TextOutputFormat.class);
            job.setMapOutputKeyClass(Text.class);
            job.setMapOutputValueClass(Text.class);
            FileInputFormat.setInputPaths(job, args[0]);
            FileOutputFormat.setOutputPath(job, new Path(args[1]));

            boolean success = job.waitForCompletion(true);
            return success ? 0 : 1;
        } catch (Exception e) {
            e.printStackTrace();
            return 1;
        }

    }

    public static void main(String[] args) throws Exception {

        if (args.length < 3) {
            System.out
                    .println("Usage: SomeJob <comma sparated list of input directories> <output dir> <cache file>");
            System.exit(-1);
        }

        int result = ToolRunner.run(new TaxonomyOverviewReportingStepOne(), args);
        System.exit(result);
    }

}

java - マッパーとレデューサーにJavaコードを使用するEMRストリーミングジョブ

2 に答える 2

Related

Reference