hadoop - Hadoop: マッパー出力を出力ファイルに書き込むリデューサー

Question

非常に奇妙な問題に遭遇しました。レデューサーは機能しますが、出力ファイルを確認すると、マッパーからの出力しか見つかりませんでした。デバッグしようとしたときに、マッパーの出力値の型を Longwritable から Text に変更した後、ワードカウントのサンプルで同じ問題が見つかりました。

    package org.myorg;

import java.io.IOException;
import java.util.*;

import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.mapreduce.lib.input.*;
import org.apache.hadoop.mapreduce.lib.output.*;
import org.apache.hadoop.util.*;

public class WordCount extends Configured implements Tool {

   public static class Map
       extends Mapper<LongWritable, Text, Text, Text> {
     private final static IntWritable one = new IntWritable(1);
     private Text word = new Text();

     public void map(LongWritable key, Text wtf, Context context)
         throws IOException, InterruptedException {
       String line = wtf.toString();
       StringTokenizer tokenizer = new StringTokenizer(line);
       while (tokenizer.hasMoreTokens()) {
         word.set(tokenizer.nextToken());
         context.write(word, new Text("frommapper"));
       }
     }
   }

   public static class Reduce
       extends Reducer<Text, Text, Text, Text> {
     public void reduce(Text key, Text wtfs,
         Context context) throws IOException, InterruptedException {
/*
       int sum = 0;
       for (IntWritable val : wtfs) {
         sum += val.get();
       }
       context.write(key, new IntWritable(sum));*/
    context.write(key,new Text("can't output"));
     }
   }

   public int run(String [] args) throws Exception {
     Job job = new Job(getConf());
     job.setJarByClass(WordCount.class);
     job.setJobName("wordcount");


     job.setOutputKeyClass(Text.class);
     job.setMapOutputValueClass(Text.class);
       job.setOutputValueClass(Text.class);
     job.setMapperClass(Map.class);
     //job.setCombinerClass(Reduce.class);
     job.setReducerClass(Reduce.class);

     job.setInputFormatClass(TextInputFormat.class);
     job.setOutputFormatClass(TextOutputFormat.class);

     FileInputFormat.setInputPaths(job, new Path(args[0]));
     FileOutputFormat.setOutputPath(job, new Path(args[1]));

     boolean success = job.waitForCompletion(true);
     return success ? 0 : 1;
         }

   public static void main(String[] args) throws Exception {
     int ret = ToolRunner.run(new WordCount(), args);
     System.exit(ret);
   }
}

ここに結果があります

JobClient:     Combine output records=0
12/06/13 17:37:46 INFO mapred.JobClient:     Map input records=7
12/06/13 17:37:46 INFO mapred.JobClient:     Reduce shuffle bytes=116
12/06/13 17:37:46 INFO mapred.JobClient:     Reduce output records=7
12/06/13 17:37:46 INFO mapred.JobClient:     Spilled Records=14
12/06/13 17:37:46 INFO mapred.JobClient:     Map output bytes=96
12/06/13 17:37:46 INFO mapred.JobClient:     Combine input records=0
12/06/13 17:37:46 INFO mapred.JobClient:     Map output records=7
12/06/13 17:37:46 INFO mapred.JobClient:     Reduce input records=7

その後、出力ファイルに奇妙な結果が見つかりました。この問題は、reduce 出力値のタイプを変更したかどうかに関係なく、map の出力値のタイプとレデューサーの入力キーのタイプを Text に変更した後に発生しました。job.setOutputValue(Text.class) の変更も余儀なくされました

a   frommapper
a   frommapper
a   frommapper
gg  frommapper
h   frommapper
sss frommapper
sss frommapper

ヘルプ！

score 4 · Accepted Answer

reduce 関数の引数は次のようにする必要があります。

public void reduce(Text key, Iterable <Text> wtfs,
     Context context) throws IOException, InterruptedException {

引数を定義した方法では、reduce 操作は値のリストを取得していないため、 map 関数から取得した入力を出力するだけです。

sum+ = val.get()

<key, value>フォーム内の各ペア<word, one>が個別にレデューサーに到達するため、毎回 0 から 1 になるだけです。

また、マッパー関数は通常、出力ファイルに書き込みません (聞いたことはありませんが、それが可能かどうかはわかりません)。通常、出力ファイルに書き込むのは常にレデューサーです。マッパー出力は、Hadoop によって透過的に処理される中間データです。したがって、出力ファイルに何かが表示される場合、それはマッパーの出力ではなく、リデューサーの出力でなければなりません。これを確認したい場合は、実行したジョブのログに移動し、各マッパーとリデューサーで何が起こっているかを個別に確認できます。

これでいくつかの問題が解決することを願っています。

hadoop - Hadoop: マッパー出力を出力ファイルに書き込むリデューサー

1 に答える 1

Related

Reference