hadoop - ファイル処理全体の最後にキー値を発行するにはどうすればよいですか?

Question

マッパーはファイルから行を読み取ります...行ごとではなく、ファイル全体をスキャンした後、最後にキー値を発行するにはどうすればよいですか?

score 2 · Accepted Answer

新しい mapreduce API を使用すると、メソッドをオーバーライドして、map メソッドで通常どおりにMapper.cleanup(Context)使用できます。Context.write(K, V)

@Override
protected void cleanup(Context context) {
  context.write(new Text("key"), new Text("value"));
}

古い mapred API ではメソッドをオーバーライドできますが、指定されたへの参照を map メソッドclose()に保存する必要があります。OutputCollector

private OutputCollector cachedCollector = null;

void map(Longwritable key, Text value, OutputCollector outputCollector, Reporter reporter) {
  if (cachedCollector == null) {
    cachedCollector = outputCollector;
  }

  // ...
}

public void close() {
  cachedCollector.collect(outputKey, outputValue);
}

score 0 · Accepted Answer

run()クリスの答えに代わる別の方法は、Mapper クラス (新しい API)をオーバーライドすることでこれを達成できることです。

public static class Map extends Mapper<IntWritable, IntWritable, IntWritable, IntWritable> {

  //map method here

  // Override the run()
  @override
  public void run(Context context) throws IOException, InterruptedException {
    setup(context);
    while (context.nextKeyValue()) {
      map(context.getCurrentKey(), context.getCurrentValue(), context);
    }
  // Have your last <key,value> emitted here
  context.write(lastOutputKey, lastOutputValue);
  cleanup(context);
  }
}

そして、各マッパーが確実に 1 つのファイルを処理するようにするには、次のように独自のバージョンのFileInputFormatand override isSplittable()を作成する必要があります。

Class NonSplittableFileInputFormat extends FileInputFormat{

@Override 
    public boolean isSplitable(FileSystem fs, Path filename){ 
        return false; 
    }
}

score 0 · Accepted Answer

ファイル全体または複数に対して 1 つの Key 値がありますか?

ケース #1 の場合: WholeFileInputFormat を使用します。完全なファイルコンテンツを 1 つのレコードとして受け取ります。これをレコードに分割し、すべてのレコードを処理し、処理の最後に最終的なキー/値を発行できます

Cae #2: 同じ fileInputFormat を使用します。すべてのキー値を一時ストレージに保存します。最後に、一時ストレージにアクセスし、必要なキー/値を発行し、不要なものを抑制します

hadoop - ファイル処理全体の最後にキー値を発行するにはどうすればよいですか?

3 に答える 3

Related

Reference