これを最適化することについて心配する必要はありません。フィールドの名前を変更すると、わずかなオーバーヘッドが発生する可能性がありますが、追加の Map/Reduce ジョブがトリガーされることはありません。フィールドの投影は、 の後にレデューサーで発生しますJOIN
。
以下に示す2 つのコードと Map Reduce プランを検討してくださいexplain
。
名前を変更せずに
A = load 'first' using PigStorage() as (f1, f2, id);
B = load 'second' using PigStorage() as (g1, g2, id);
C = join A by id, B by id;
store C into 'output';
#--------------------------------------------------
# Map Reduce Plan
#--------------------------------------------------
MapReduce node scope-30
Map Plan
Union[tuple] - scope-31
|
|---C: Local Rearrange[tuple]{bytearray}(false) - scope-20
| | |
| | Project[bytearray][2] - scope-21
| |
| |---A: New For Each(false,false,false)[bag] - scope-7
| | |
| | Project[bytearray][0] - scope-1
| | |
| | Project[bytearray][1] - scope-3
| | |
| | Project[bytearray][2] - scope-5
| |
| |---A: Load(hdfs://location/first:PigStorage) - scope-0
|
|---C: Local Rearrange[tuple]{bytearray}(false) - scope-22
| |
| Project[bytearray][2] - scope-23
|
|---B: New For Each(false,false,false)[bag] - scope-15
| |
| Project[bytearray][0] - scope-9
| |
| Project[bytearray][1] - scope-11
| |
| Project[bytearray][2] - scope-13
|
|---B: Load(hdfs://location/second:PigStorage) - scope-8--------
Reduce Plan
C: Store(hdfs://location/output:org.apache.pig.builtin.PigStorage) - scope-27
|
|---POJoinPackage(true,true)[tuple] - scope-32--------
Global sort: false
----------------
名前の変更あり
A = load 'first' using PigStorage() as (f1, f2, id);
B = load 'second' using PigStorage() as (g1, g2, id);
C = join A by id, B by id;
C = foreach C generate A::f1 as f1, -- This
A::f2 as f2, -- section
B::id as id, -- is
B::g1 as g1, -- different
B::g2 as g2; --
store C into 'output';
#--------------------------------------------------
# Map Reduce Plan
#--------------------------------------------------
MapReduce node scope-41
Map Plan
Union[tuple] - scope-42
|
|---C: Local Rearrange[tuple]{bytearray}(false) - scope-20
| | |
| | Project[bytearray][2] - scope-21
| |
| |---A: New For Each(false,false,false)[bag] - scope-7
| | |
| | Project[bytearray][0] - scope-1
| | |
| | Project[bytearray][1] - scope-3
| | |
| | Project[bytearray][2] - scope-5
| |
| |---A: Load(hdfs://location/first:PigStorage) - scope-0
|
|---C: Local Rearrange[tuple]{bytearray}(false) - scope-22
| |
| Project[bytearray][2] - scope-23
|
|---B: New For Each(false,false,false)[bag] - scope-15
| |
| Project[bytearray][0] - scope-9
| |
| Project[bytearray][1] - scope-11
| |
| Project[bytearray][2] - scope-13
|
|---B: Load(hdfs://location/second:PigStorage) - scope-8--------
Reduce Plan
C: Store(hdfs://location/output:org.apache.pig.builtin.PigStorage) - scope-38
|
|---C: New For Each(false,false,false,false,false)[bag] - scope-37
| |
| Project[bytearray][0] - scope-27
| |
| Project[bytearray][1] - scope-29
| |
| Project[bytearray][5] - scope-31
| |
| Project[bytearray][3] - scope-33
| |
| Project[bytearray][4] - scope-35
|
|---POJoinPackage(true,true)[tuple] - scope-43--------
Global sort: false
----------------
違いはReduceプランにあります。名前を変更しない場合:
Reduce Plan
C: Store(hdfs://location/output:org.apache.pig.builtin.PigStorage) - scope-27
|
|---POJoinPackage(true,true)[tuple] - scope-32--------
Global sort: false
対名前の変更:
Reduce Plan
C: Store(hdfs://location/output:org.apache.pig.builtin.PigStorage) - scope-38
|
|---C: New For Each(false,false,false,false,false)[bag] - scope-37
| |
| Project[bytearray][0] - scope-27
| |
| Project[bytearray][1] - scope-29
| |
| Project[bytearray][5] - scope-31
| |
| Project[bytearray][3] - scope-33
| |
| Project[bytearray][4] - scope-35
|
|---POJoinPackage(true,true)[tuple] - scope-43--------
Global sort: false
つまり、名前の変更について心配する前に、スクリプトで最適化できることが他にもあります。とにかくすべてのレコードをjoin
.