hadoop - Pig approach to pairing data fields in a data set

Question

I'm new to Pig and trying to correctly implement a somewhat common algorithm in which I need to pair every matching record in a set of records. In order to distill the question into its simplest form and also avoid discussing some business-specific sensitivities, here's a mock problem:

Say that I have a dataset representing college classes and students that attend them:

Philosophy,John
English,Mary
English,Sue
History,Jack
Philosophy,David
English,Mark
English,Larry

I want to pair every association between students that took the same class; so the output would include this, showing the explosion of the four 'English' rows into six associations:

Philosphy   John,David
English    Mary,Sue
English    Mary,Mark
English    Mary,Larry
English    Sue,Mark
English    Sue,Larry
English    Mark,Larry

This page: http://ofps.oreilly.com/titles/9781449302641/advanced_pig_latin.html refers to using flatten() to effect the cross product. I have tried several approaches and researched this extensively and would post my attempts but honestly I'm flailing and I think that would just confuse the reader and not provide any value. But here's the boilerplate:

s = load 'classes' using PigStorage(',') as (class:chararray, student:chararray);
grp = group s by class;
...

(I believe the problem I'm facing has to do with flatten requiring multiple bags, not multiple fields, and I can't figure out how to get my group'ing to generate multiple bags...)

Thank you for any assistance!

score 6 · Accepted Answer

You can use the UnorderedPairs UDF from LinkedIn's Datafu project. Download the package from here and issue the followings (tested on Pig v0.10.0) :

register '/home/user/datafu/dist/datafu-0.0.4.jar'
define UnorderedPairs datafu.pig.bags.UnorderedPairs();
A = load 'classes' using PigStorage(',') as (class:chararray, student:chararray);
B = GROUP A BY class;
C = FOREACH B GENERATE group, FLATTEN(UnorderedPairs(A.student));

When further flattening the result:

D = FOREACH C generate FLATTEN($0) as (class:chararray), 
      FLATTEN($1) as (student1:chararray), FLATTEN($2) as (student2:chararray);

You'll end up having the desired result:

dump D;

(English,Mary,Sue)
(English,Mary,Mark)
(English,Mary,Larry)
(English,Sue,Mark)
(English,Sue,Larry)
(English,Mark,Larry)
(Philosophy,John,David)

score 1 · Accepted Answer

There are two approaches I see to this. I have not tried either in quite some time, so please follow up and let us know if they worked well or not.

The first approach is a self join

s1 = load 'classes' using PigStorage(',') as (class:chararray, student:chararray);
s2 = load 'classes' using PigStorage(',') as (class:chararray, student:chararray);
b = JOIN s1 BY class, s2 BY class;
...

The downside of this is that you have to load the data twice. There is some discussion on why this sucks, but it's just how you have to do it.

The other option would be to use CROSS nested in a FOREACH after the GROUP:

Note: I'm not sure at all if this will work, or if I got the syntax right (I'm not in an environment that I could test this right now). Perhaps someone can confirm.

B = GROUP s BY class;
C = FOREACH B {                          
   DA = CROSS s, s;                       
   GENERATE FLATTEN(DA);
}

score 1 · Accepted Answer

This can be done with a self-join and some simple filtering.

classes1 = load 'classes' using PigStorage(',') as (class:chararray, student:chararray);
classes2 = load 'classes' using PigStorage(',') as (class:chararray, student:chararray);
joined = JOIN classes1 BY class, classes2 BY class;
filtered = FILTER joined BY classes1.student < classes2.student;
pairs = FOREACH filtered GENERATE classes1.student AS student1, classes2.student AS student2;

Note that filtering by student1 < student2 gets you unique pairs.

hadoop - Pig approach to pairing data fields in a data set

3 に答える 3

Related

Reference