hadoop - Map-reduce ジョブでキーのハッシュコードを既に実装している場合、customPartitioner は役に立ちますか?

Question

実装せずにカスタムkeyクラスを作成しています。hashCode

ジョブを実行しmap-reduceますが、ジョブの構成中にpartitoner次のようなクラスを設定します。

        Job job = Job.getInstance(config);
        job.setJarByClass(ReduceSideJoinDriver.class);

        FileInputFormat.addInputPaths(job, filePaths.toString());
        FileOutputFormat.setOutputPath(job, new Path(args[args.length-1]));

        job.setMapperClass(JoiningMapper.class);
        job.setReducerClass(JoiningReducer.class);
        job.setPartitionerClass(TaggedJoiningPartitioner.class); -- Here is the partitioner set
        job.setGroupingComparatorClass(TaggedJoiningGroupingComparator.class);
        job.setOutputKeyClass(TaggedKey.class);
        job.setOutputValueClass(Text.class);
        System.exit(job.waitForCompletion(true) ? 0 : 1);

partitioner実装は次のとおりです。

public class TaggedJoiningPartitioner extends Partitioner<TaggedKey,Text> {

    @Override
    public int getPartition(TaggedKey taggedKey, Text text, int numPartitions) {
        return Math.abs(taggedKey.getJoinKey().hashCode()) % numPartitions;
    }
}

map-reduceジョブを実行し、出力を保存します。

job.setPartitionerClass(TaggedJoiningPartitioner.class);ここで、上記のジョブ設定でコメントアウトします。

hashCode()次のようなカスタムクラスに実装しました：

public class TaggedKey implements Writable, WritableComparable<TaggedKey> {

    private Text joinKey = new Text();
    private IntWritable tag = new IntWritable();

    @Override
    public int compareTo(TaggedKey taggedKey) {
        int compareValue = this.joinKey.compareTo(taggedKey.getJoinKey());
        if(compareValue == 0 ){
            compareValue = this.tag.compareTo(taggedKey.getTag());
        }
       return compareValue;
    }

    @Override
    public void write(DataOutput out) throws IOException {
        joinKey.write(out);
        tag.write(out);
    }

    @Override
    public void readFields(DataInput in) throws IOException {
        joinKey.readFields(in);
        tag.readFields(in);
    }

    @Override
    public int hashCode(){
        return joinKey.hashCode();
    }

    @Override
    public boolean equals(Object o){
        if (this==o)
            return true;
        if (!(o instanceof TaggedKey)){
            return false;
        }
        TaggedKey that=(TaggedKey)o;
        return this.joinKey.equals(that.joinKey);
    }
}

ここで、ジョブを再度実行します (注: セットはありませんpartitoner)。map-reduce ジョブの後、前の出力を比較します。どちらもまったく同じです。

だから私の質問は：

   1)  Is this behavior universal, that is always reproducible in any
        custom implementations? 

    2) Does implementing hashcode on my key class is same as doing a
    job.setPartitionerClass.

    3) If they both serve same purpose, what is the need for
    setPartitonerClass?

    4) if both hashcode() implementation and Partitonerclass
    implementation are conflicting, which one will take precedence?

score 0 · Accepted Answer

カスタムパーティショナーがデフォルトのパーティショナーとまったく同じことを行っているため、同じ結果が得られます。コードを別のクラスに移動して、そこで実行するだけです。key().toString().length() % numPartitions のような別のロジックや、hashcode() % numPartitions を取得する以外の何かを配置すると、リデューサーへのキーの別の分散が表示されます。

たとえば、hashcode() を編集するだけでは、このパーティショナーを取得できません。

public static class MyPartitioner extends Partitioner {

    @Override
    public int getPartition(Text key, Text value, int numReduceTasks) {

        int len = key.value().length;

        if(numReduceTasks == 0)
            return 0;

        if(len <=numReduceTasks/3){               
            return 0;
        }
        if(len >numReduceTasks/3 && len <=numReduceTasks/2){

            return 1 % numReduceTasks;
        }
        else
            return len % numReduceTasks;
    }
}

hadoop - Map-reduce ジョブでキーのハッシュコードを既に実装している場合、customPartitioner は役に立ちますか?

1 に答える 1

Related

Reference