java - Hadoop MR hold array reference in reduce method

Question

I would like to have an arrayList that holds reference to object inside the reduce function.

@Override
public void reduce( final Text pKey,
                    final Iterable<BSONWritable> pValues,
                    final Context pContext )
        throws IOException, InterruptedException{
    final ArrayList<BSONWritable> bsonObjects = new ArrayList<BSONWritable>();

    for ( final BSONWritable value : pValues ){
        bsonObjects.add(value);
        //do some calculations.
    }
   for ( final BSONWritable value : bsonObjects ){
       //do something else.
   }
   }

The problem is that the bsonObjects.size() returns the correct number of elements but all the elements of the list are equal to the last inserted element. e.g. if the

{id:1}

{id:2}

{id:3}

elements are to be inserted the bsonObjects will hold 3 items but all of them will be {id:3}. Is there a problem with this approach? any idea why this happens? I have tried to change the List to a Map but then only one element was added to the map. Also I have tried to change the declaration of the bsonObject to global but the same behavior happes.

score 2 · Accepted Answer

これは文書化された動作です。その理由は、pValues IteratorがBSONWritableインスタンスを再利用し、ループ内で値が変更されると、bsonObjectsArrayList内のすべての参照も更新されるためです。bsonObjectsでadd（）を呼び出すときに、参照を保存しています。このアプローチにより、Hadoopはメモリを節約できます。

変数値に等しい最初のループで新しいBSONWritable変数をインスタンス化する必要があります（ディープコピー）。次に、新しい変数をbsonObjectsに追加します。

これを試して：

for ( final BSONWritable value : pValues ){
    BSONWritable v = value; 
    bsonObjects.add(v);
    //do some calculations.
}
for ( final BSONWritable value : bsonObjects ){
   //do something else.
}

次に、2番目のループでbsonObjectsを反復処理し、それぞれの個別の値を取得できるようになります。

ただし、注意が必要です。ディープコピーを作成する場合は、このレデューサーのキーのすべての値をメモリに収める必要があります。

java - Hadoop MR hold array reference in reduce method

1 に答える 1

Related

Reference