sql - How to use min() in where/having clause (to avoid subquery) in Hive/SQL

Question

I have a large table of events. Per user I want to count the occurence of type A events before the earliest type B event.

I am searching for an elegant query. Hive is used so I can't do subqueries

Timestamp Type User 
...        A    X
...        A    X
...        B    X
...        A    X
...        A    X

...        A    Y
...        A    Y
...        A    Y
...        B    Y
...        A    Y

Wanted Result:

User Count_Type_A 
X    2
Y    3

I could not get the "cut-off" timestamp by doing:

Select User, min(Timestamp) 
Where Type=B 
Group BY User;

But then how can I use that information inside the next query where I want to do something like:

SELECT User, count(Timestamp) 
WHERE Type=A AND Timestamp<min(User.Timestamp_Type_B) 
GROUP BY User;

My only idea so far are to determine the cut-off timestamps first and then do a join with all type A events and then select from the resulting table, but that feels wrong and would look ugly.

I'm also considering the possibility that this is the wrong type of problem/analysis for Hive and that I should consider hand-written map-reduce or pig instead.

Please help me by pointing in the right direction.

score 1 · Accepted Answer

一般的に、私は coge.soft のソリューションを +1 します。参考までにもう一度ご紹介します：

SELECT [User], COUNT([Timestamp]) AS [Before_First_B_Count]
FROM [Dataset] main
JOIN (SELECT [User], min([Timestamp]) [First_B_TS] FROM [Dataset]
    WHERE [Type] = 'B'
    GROUP BY [User]) sub 
        ON (sub.[User] = main.[User]) AND (main.[Timestamp] < sub.[First_B_TS])
WHERE main.[Type] = 'A'
GROUP BY main.[User]

ただし、次の点に注意してください。

Bイベントがない場合はどうなりますか? その場合、サブテーブルにそのユーザーのエントリがないため、ソリューションで指定されている内部結合は機能しません。そのためには、左外部結合に変更する必要があります。
このソリューションでは、データに対して 2 つのパスも実行します。1 つはサブテーブルにデータを入力するため、もう 1 つはサブテーブルをメインテーブルに結合するためです。パフォーマンスと効率の概念に応じて、データの単一パスでこれを実行できる代替手段があります。Hive の配布機能を使用してユーザーごとにデータを配布し、Hive の変換機能を使用して、お気に入りの言語でカウント計算を行うカスタムレデューサーを作成できます。

score 1 · Accepted Answer

最初の更新:

この回答に対するCilvicの最初のコメントに応えて、https://issues.apache.org/jira/browse/HIVE-556にあるコメントで提案されている回避策に基づいて、クエリを次のように調整しました。

SELECT [User], COUNT([Timestamp]) AS [Before_First_B_Count]
FROM [Dataset] main
CROSS JOIN (SELECT [User], min([Timestamp]) [First_B_TS] FROM [Dataset]
    WHERE [Type] = 'B'
    GROUP BY [User]) sub 
WHERE main.[Type] = 'A'
AND (sub.[User] = main.[User]) 
AND (main.[Timestamp] < sub.[First_B_TS])
GROUP BY main.[User]

オリジナル：

これを試してみてください：

SELECT [User], COUNT([Timestamp]) AS [Before_First_B_Count]
FROM [Dataset] main
JOIN (SELECT [User], min([Timestamp]) [First_B_TS] FROM [Dataset]
    WHERE [Type] = 'B'
    GROUP BY [User]) sub 
        ON (sub.[User] = main.[User]) AND (main.[Timestamp] < sub.[First_B_TS])
WHERE main.[Type] = 'A'
GROUP BY main.[User]

ハイブ構文に従うように最善を尽くしました。ご不明な点がございましたら、お知らせください。サブクエリを回避したい/回避する必要がある理由を知りたいです。

sql - How to use min() in where/having clause (to avoid subquery) in Hive/SQL

2 に答える 2

Related

Reference