java - Hadoop mapreduce のキー間で値の不均等な分布を処理する

Question

キーが均等に分散されていない Hadoop の入力ログファイルを処理しています。これは、レデューサーの値の分布が不均一であることを意味します。たとえば、key1 には 1 つの値があり、key2 には 1000 の値があります。

同じキーに関連付けられた値の負荷分散を行う方法はありますか [自分のキーも変更したくない]

score 0 · Accepted Answer

おそらく、レデューサーに当たる前にコンバイナーを使用できますか? これはかなり投機的です...

キーの各グループを事前設定された最大サイズのパーティションに分割し、これらの分割された k/v ペアをレデューサーに出力するという考え方です。このコードは、構成のどこかにそのサイズが設定されていることを前提としています。

public static class myCombiner extends Reducer<Text, Text, Text, Text> {
    public void reduce(Text key, Iterable<Text> values, Context context)
            throws IOException, InterruptedException {

        List<Text> textList = new ArrayList<Text>();
        int part = 0;

        while (values.iterator().hasNext()) {
            if (textList.size() <= Integer.parseInt(context.getConfiguration().get("yourMaxSize"))) {
                textList.add(values.iterator().next());

            } else {
                for(Text t : textList) {
                    //essentially partitioning each key...
                    context.write(new Text(key.toString() + "_" + Integer.toString(part)), t);
                }
                textList.clear();
            }
            part += 1;
        }
        //output any stragglers ... 
        for(Text t : textList) {
            context.write(new Text(key.toString() + "_" + Integer.toString(part)), t);
        }

    }
}

java - Hadoop mapreduce のキー間で値の不均等な分布を処理する

2 に答える 2

Related

Reference