hadoop - ファイルの単語数の一般的な単語

Question

Hadoopの単語数の例を非分散モードで実行することができました。「part-00000」という名前のファイルで出力を取得します。すべての入力ファイルを組み合わせたすべての単語が一覧表示されていることがわかります。

単語数コードをトレースした後、行を取り、スペースに基づいて単語を分割していることがわかります。

複数のファイルで発生した単語とその発生を一覧表示する方法を考えていますか？これはMap/Reduceで実現できますか？-追加-これらの変更は適切ですか？

      //changes in the parameters here

    public static class Map extends MapReduceBase implements Mapper<Text, Text, Text, Text> {

         // These are the original line; I am not using them but left them here...
      private final static IntWritable one = new IntWritable(1);
      private Text word = new Text();

                    //My changes are here too

        private Text outvalue=new Text();
        FileSplit fileSplit = (FileSplit)reporter.getInputSplit();
        private String filename = fileSplit.getPath().getName();;



      public void map(Text key, Text value, OutputCollector<Text, Text> output, Reporter reporter) throws IOException {

        String line = value.toString();
        StringTokenizer tokenizer = new StringTokenizer(line);
        while (tokenizer.hasMoreTokens()) {
        word.set(tokenizer.nextToken());

          //    And here        
              outvalue.set(filename);
          output.collect(word, outvalue);

        }

      }

    }

score 0 · Accepted Answer

マッパーを修正して単語をキーとして出力し、次にテキストを単語の由来のファイル名を表す値として出力することができます。次に、レデューサーで、ファイル名を重複排除し、単語が複数のファイルに表示されるエントリを出力する必要があります。

処理中のファイルのファイル名を取得するには、新しいAPIを使用しているかどうか（mapredまたはmapreduceパッケージ名）によって異なります。新しいAPIの場合、 getInputSplitメソッドを使用してContextオブジェクトからマッパー入力分割を抽出できることを知っています（この場合、を使用していると仮定すると、おそらくInputSplittoになります）。古いAPIの場合、試したことはありませんが、次のような構成プロパティを使用できるようです。FileSplitTextInputFormatmap.input.file

これは、コンバイナーを導入する場合にも適しています。同じマッパーから複数の単語の出現を重複排除します。

アップデート

したがって、問題に対応して、マッパーのクラスscoptに存在しないreporterというインスタンス変数を使用しようとしています。次のように修正します。

public static class Map extends MapReduceBase implements Mapper<Text, Text, Text, Text> {
  // These are the original line; I am not using them but left them here...
  private final static IntWritable one = new IntWritable(1);
  private Text word = new Text();

  //My changes are here too
  private Text outvalue=new Text();
  private String filename = null;

  public void map(Text key, Text value, OutputCollector<Text, Text> output, Reporter reporter) throws IOException {
    if (filename == null) {
      filename = ((FileSplit) reporter.getInputSplit()).getPath().getName();
    }

    String line = value.toString();
    StringTokenizer tokenizer = new StringTokenizer(line);
    while (tokenizer.hasMoreTokens()) {
      word.set(tokenizer.nextToken());

      //    And here        
      outvalue.set(filename);
      output.collect(word, outvalue);
    }
  }
}

（SOが上記のフォーマットを尊重しない理由は本当にわかりません...）

hadoop - ファイルの単語数の一般的な単語

1 に答える 1

Related

Reference