hadoop - Hive でのカウントとグループ化

Question

次のように、ハイブにテーブルがあります。

table1

Cola   | Colb  |  Colc |  Cold  |
---------------------------------
...etc
efo18   691 123 5692                                 
efo18   691 345 5657
...etc
fsx31   950 291 23456                                                         
fsx31   950 404 23456                                                          
fsx31   950 343 23456                                                         
fsx31   950 182 23456                                                         
fsx31   950 120 45042                                                         
fsx31   950 161 23456  
....etc
klz57   490 121 3330                                                          
klz57   490 113 3330                                                          
klz57   490 308 3330                                                          
klz57   490 411 3330                                                           
klz57   490 161 3330                                                          
klz57   386 108 3330                                                          
klz57   490 113 3330                                                          
klz57   490 125 3330                                                          
klz57   490 165 3330                                                          
klz57   490 166 3330  
...etc
---------------------------------

グループ内でtable1同じ値を持つデータから別のテーブルが必要で、その中で同じ値を持つものはサブグループを持ち、そのサブグループ内で同じ値を持つものはグループに属していました。つまり、各一意の組み合わせが行になります。そして、繰り返される行が合計されます。ColdColbColaCola,Colb,Cold

insert into table table2(Col1 string,Col2 string,Col3 string,Count int) select cola,colb,cold,count(*) from table1 group by cold,colb,cola;

こうなることは予想していましたが、

Col1   | Col2  |  Col3     |  Count  |
-------------------------------------
efo18    691     5692         1
efo18    691     5657         1
fsx31    950     23456        5   <-----1
fsx31    950     45042        1   <-----1
klz57    490     1234         9   <-----2
klz57    386     1234         1   <-----2
--------------------------------------

私はこれを得た、

table2

Col1   | Col2  |  Col3     |  Count  |
-------------------------------------
efo18    691     5692         1
efo18    691     5657         1
fsx31    950     23456        4   <-----1
fsx31    950     25456        1   <-----1
fsx31    950     45042        1   <-----1
klz57    490     1234         8   <-----2
klz57    386     1234         1   <-----2
klz57    490     1234         1   <-----2
--------------------------------------

私が理解していないのは、グループ化を行っているということです。次に、Cold次にがColb続きます。次に、マークされた行 (<----1),for values fromが異なる行にあるColaのはなぜですか?同じグループ？これらの2つの行は異なりますが、spのグループ化には使用していません.2つの行はどのように異なりますか?. 同様に、(<----2) とマークされた行の場合、ここでの問題は何ですか。CountColaColc

アップデート：

Binary01、私はあなたが与えた例を試していました

hive> select * from xyz;
OK
x        y       z      zz
xxx     111     222     123 NULL    NULL    NULL
xxx     111     222     123 NULL    NULL    NULL
xxx     101     222     123 NULL    NULL    NULL
xux     111     422     123 NULL    NULL    NULL
xxx     111     522     323 NULL    NULL    NULL
xyx     111     622     123 NULL    NULL    NULL
xxx     115     322     123 NULL    NULL    NULL
xxx     111     122     123 NULL    NULL    NULL
xxx     111     223     123 NULL    NULL    NULL
xxy     111     212     143 NULL    NULL    NULL
xxx     117     222     123 NULL    NULL    NULL

それらの NULL 値はそこで何をしていますか? あなたの例を1行ずつコピーして貼り付けました。としてテーブルを作成しても、

create table xyz(x string ,y string, z string , zz string) 
row format delimited fields terminated by ',';

そして最後のクエリは、

hive> select * from xyztemp;
OK
xux     111     422     123 NULL    NULL    1
xxx     101     222     123 NULL    NULL    1
xxx     111     122     123 NULL    NULL    1
xxx     111     222     123 NULL    NULL    2
xxx     111     223     123 NULL    NULL    1
xxx     111     522     323 NULL    NULL    1
xxx     115     322     123 NULL    NULL    1
xxx     117     222     123 NULL    NULL    1
xxy     111     212     143 NULL    NULL    1
xyx     111     622     123 NULL    NULL    1

score 4 · Accepted Answer

あなたは何かを逃したに違いない。あなたのテーブルに似た次のデータを試してみましたが、出力が期待どおりに完全に問題ないことを確認してください。

hive>set hive.cli.print.header=true;
hive> load data local inpath '/home/brdev/sudeep/testdata.txt' into table xyz;
hive> create table xyz(x string ,y string, z string , zz string) row format delimited fields terminated by ',';
hive> select * from xyz;
OK
x       y       z       zz
xxx     111     222     123
xxx     111     222     123
xxx     101     222     123
xux     111     422     123
xxx     111     522     323
xyx     111     622     123
xxx     115     322     123
xxx     111     122     123
xxx     111     223     123
xxy     111     212     143
xxx     117     222     123

hive>create table xyztemp ( aa string,bb string,cc string , dd int);
hive>insert into table xyztemp select x,y,zz,count(*) from xyz group by zz,y,x;
hive> select * from xyztemp;
OK
aa      bb      cc      dd
xxx     101     123     1
xux     111     123     1
xxx     111     123     4
xyx     111     123     1
xxx     115     123     1
xxx     117     123     1
xxy     111     143     1
xxx     111     323     1

上記は、あなたが探している期待される出力だと思います。

hadoop - Hive でのカウントとグループ化

2 に答える 2

Related

Reference