hadoop - Hadoop MapReduce を使用して並べ替えられた単語数

Question

私は MapReduce を初めて使用し、Hadoop の単語カウントの例を完成させました。

その例では、単語数のソートされていないファイル (キーと値のペアを含む) を生成します。別の MapReduce タスクを以前のタスクと組み合わせることで、単語の出現回数で並べ替えることができますか?

score 1 · Accepted Answer

単純なワードカウントマップ削減プログラムでは、得られる出力はワードごとに並べ替えられます。サンプル出力は次のとおりです。
Apple 1
Boy 30
Cat 2
Frog 20
Zebra 1
単語の出現数に基づいて出力をソートする場合、つまり以下の形式で出力したい場合
1 Apple
1 Zebra
2 Cat
20 Frog
30 Boy
別のものを作成できます以下のマッパーとリデューサーを使用する MR プログラム。入力は単純なワードカウントプログラムから取得した出力になります。

class Map1 extends MapReduceBase implements Mapper<Object, Text, IntWritable, Text>
{
    public void map(Object key, Text value, OutputCollector<IntWritable, Text> collector, Reporter arg3) throws IOException 
    {
        String line = value.toString();
        StringTokenizer stringTokenizer = new StringTokenizer(line);
        {
            int number = 999; 
            String word = "empty";

            if(stringTokenizer.hasMoreTokens())
            {
                String str0= stringTokenizer.nextToken();
                word = str0.trim();
            }

            if(stringTokenizer.hasMoreElements())
            {
                String str1 = stringTokenizer.nextToken();
                number = Integer.parseInt(str1.trim());
            }

            collector.collect(new IntWritable(number), new Text(word));
        }

    }

}


class Reduce1 extends MapReduceBase implements Reducer<IntWritable, Text, IntWritable, Text>
{
    public void reduce(IntWritable key, Iterator<Text> values, OutputCollector<IntWritable, Text> arg2, Reporter arg3) throws IOException
    {
        while((values.hasNext()))
        {
            arg2.collect(key, values.next());
        }

    }

}

score 0 · Accepted Answer

Hadoopでは、マップフェーズとリデュースフェーズの間で並べ替えが行われます。単語の出現順に並べ替える1つのアプローチは、何もグループ化しないカスタムグループコンパレータを使用することです。したがって、reduceのすべての呼び出しは、キーと1つの値にすぎません。

public class Program {
   public static void main( String[] args) {

      conf.setOutputKeyClass( IntWritable.class);
      conf.setOutputValueClass( Text.clss);
      conf.setMapperClass( Map.class);
      conf.setReducerClass( IdentityReducer.class);
      conf.setOutputValueGroupingComparator( GroupComparator.class);   
      conf.setNumReduceTasks( 1);
      JobClient.runJob( conf);
   }
}

public class Map extends MapReduceBase implements Mapper<Text,IntWritable,IntWritable,Text> {

   public void map( Text key, IntWritable value, OutputCollector<IntWritable,Text>, Reporter reporter) {
       output.collect( value, key);
   }
}

public class GroupComaprator extends WritableComparator {
    protected GroupComparator() {
        super( IntWritable.class, true);
    }

    public int compare( WritableComparable w1, WritableComparable w2) {
        return -1;
    }
}

score 0 · Accepted Answer

あなたが言ったように、1 つの可能性は、これを行うために 2 つのジョブを作成することです。最初の仕事: 簡単なワードカウントの例

2番目の仕事：仕分けの部分を行います。

疑似コードは次のようになります。

注 : 最初のジョブによって生成された出力ファイルは、2 番目のジョブの入力になります。

    Mapper2(String _key, Intwritable _value){
    //just reverse the position of _value and _key. This is useful because reducer will get the output in the sorted and shuffled manner.
    emit(_value,_key);
    }

    Reduce2(IntWritable valueofMapper2,Iterable<String> keysofMapper2){
//At the reducer side, all the keys that have the same count are merged together.
        for each K in keysofMapper2{
        emit(K,valueofMapper2); //This will sort in ascending order.
        }

    }

トリックを実行する別のコンパレータークラスを作成することが可能な降順でソートすることもできます。次のようにジョブ内にコンパレーターを含めます。

Job.setComparatorclass(Comparator.class);

このコンパレーターは、値をレデューサー側に送信する前に降順で並べ替えます。したがって、レデューサーでは、値を発行するだけです。

score 0 · Accepted Answer

Hadoop MapReduce wordcount の例からの出力は、キーで並べ替えられています。したがって、出力はアルファベット順である必要があります。

WritableComparableHadoop を使用すると、メソッドをオーバーライドできるインターフェイスを実装する独自のキーオブジェクトを作成できますcompareTo。これにより、ソート順を制御できます。

発生回数でソートされた出力を作成するには、おそらく別の MapReduce ジョブを追加して、最初の出力を処理する必要があります。この 2 番目のジョブは非常に単純で、reduce フェーズも必要ないかもしれません。Writable単語とその頻度をラップするには、独自のキーオブジェクトを実装するだけで済みます。カスタムの書き込み可能ファイルは次のようになります。

 public class MyWritableComparable implements WritableComparable {
       // Some data
       private int counter;
       private long timestamp;

       public void write(DataOutput out) throws IOException {
         out.writeInt(counter);
         out.writeLong(timestamp);
       }

       public void readFields(DataInput in) throws IOException {
         counter = in.readInt();
         timestamp = in.readLong();
       }

       public int compareTo(MyWritableComparable w) {
         int thisValue = this.value;
         int thatValue = ((IntWritable)o).value;
         return (thisValue < thatValue ? -1 : (thisValue==thatValue ? 0 : 1));
       }
     }

この例はhereから取得しました。

おそらくオーバーライドする必要がhashCodeありequalsますtoString。

hadoop - Hadoop MapReduce を使用して並べ替えられた単語数

4 に答える 4

Related

Reference