python - Cross Product in Map Reduce using Hadoop Streaming and Python

Question

I am learning Python and Hadoop. I completed the setup and basic examples provided in official site using pythong+hadoop streaming. I considered implementing join of 2 files. I completed equi-join which checks if same key appears in both input files, then it outputs the key along with values from file 1 and file 2 in that order. The equality join is working as it is supposed.

Now, I wish to do inequality join which involves finding Cross Product before applying the inequality condition. I am using the same mapper (do I need to change it) and I changed the reducer so that it contains a nested loop (since every key-value pair in file1 must be matched with all key-values pairs in file2). This doesn't work since you can only go through the stream once. Now, I thought of an option of storing 'some' values in reducer and comparing them but I have no idea 'how' many. Naive method is to store whole file2 content in a array (or similar structure) but thats stupid and goes against the idea of distributed processing. Finally, my questions are

How can I store values in reducer so that I can have cross product between two files?
In equi-join, Hadoop seems to be sending all key value pairs with same key to same reducer which is perfectly fine and works well for that case. However, how I do change this behaviour (if needed) so that required grouping of key-value pairs go correct reducer?

Sample Files: http://pastebin.com/ufYydiPu

Python Map/Reduce Scripts: http://pastebin.com/kEJwd2u1

Hadoop Command I am using:

bin/hadoop jar contrib/streaming/hadoop-*streaming*.jar -file /home/hduser/mapper.py -mapper mapper.py -file /home/hduser/ireducer.py -reducer reducer.py -input /user/hduser/inputfiles/* -output /user/hduser/join-output

Any help/hint is much appreciated.

score 3 · Accepted Answer

ネストされたループを回避するのに非常に役立つ複数の組み合わせに対処する 1 つの方法は、 itertools モジュールを使用することです。具体的には、ジェネレーターを使用してデカルト積を処理するitertools.product関数。これは、メモリ使用量と効率性に優れており、1 つのマップ削減ジョブで複数のデータセットを結合する必要がある場合、コードを大幅に簡素化できます。

マッパーによって生成されたデータとレデューサーで結合されるデータセットとの対応については、各キーのデータセットがそれほど大きくない場合は、次のような組み合わせをマッパーから単純に生成できます。

{key, [origin_1, values]}
{key, [origin_2, values]}

したがって、リデューサーで同じ起源を持つ値を辞書にグループ化できます。辞書は、itertools.product を使用してデカルト積が適用されるデータセットになります。

python - Cross Product in Map Reduce using Hadoop Streaming and Python

1 に答える 1

Related

Reference