io - Hadoop のテキストリーダークラス

Question

Map Reduce ジョブからの出力ファイルがあるディレクトリ OUTPUT があります。出力ファイルは、TextOutputFormat で記述されたテキストファイルです。

ここで、出力ファイルからキーと値のペアを読み取りたいと思います。Hadoop の既存のクラスを使用してこれを行うにはどうすればよいですか。私がそれを行うことができた1つの方法は次のとおりでした

FileSystem fs = FileSystem.get(conf);
FileStatus[] files = fs.globStatus(new Path(OUTPUT + "/part-*"));
for(FileStatus file:files){
  if(file.getLen() > 0){
    FSDataInputStream in = fs.open(file.getPath());
    BufferedReader bin = new BufferedReader(new InputStreamReader(
        in));
    String s = bin.readLine();
    while(s!=null){
      System.out.println(s);
      s = bin.readLine();
    }
    in.close();
  }
}

このアプローチは機能しますが、個々の行からキーと値のペアを手動で解析する必要があるため、タスクが大幅に増加します。いくつかの変数のキーと値を直接読み取れる、より便利なものを探しています。

score 1 · Accepted Answer

Are you forced to use TextOutputFormat as your output format in the previous job?

If not then consider using SequenceFileOutputFormat, then you can use a SequenceFile.Reader to read back the file in Key / Value pairs. You can also still 'view' the file using hadoop fs -text path/to/output/part-r-00000

EDIT: You can also use the KeyValueLineRecordReader class, you'll just need to pass in a FileSplit to teh constructor.

io - Hadoop のテキスト リーダー クラス

1 に答える 1

Related

Reference

io - Hadoop のテキストリーダークラス