sql - Redshift のグループからランダムな属性を選択します

Question

フォームにデータセットがあります。

id  |   attribute
-----------------
1   |   a
2   |   b
2   |   a
2   |   a
3   |   c

望ましい出力:

attribute|  num
-------------------
a        |  1
b,a      |  1
c        |  1

MySQL では、次を使用します。

select attribute, count(*) num 
from 
   (select id, group_concat(distinct attribute) attribute from dataset group by id) as     subquery 
group by attribute;

group_concat や、array_agg() や string_agg() などの psql グループ集計関数をサポートしていないため、これが Redshift で実行できるかどうかはわかりません。この質問を参照してください。

group_concat の代わりに、各グループからランダムな属性を選択する方法があれば、うまくいく代替ソリューションです。これは Redshift でどのように機能しますか?

score 2 · Accepted Answer

ID ごとにランダムな属性を取得する方法を見つけましたが、これはトリッキーすぎます。実際、私はそれが良い方法だとは思いませんが、うまくいきます。

SQL:

-- (1) uniq dataset 
WITH uniq_dataset as (select * from dataset group by id, attr)
SELECT 
  uds.id, rds.attr
FROM
-- (2) generate random rank for each id
  (select id, round((random() * ((select count(*) from uniq_dataset iuds where iuds.id = ouds.id) - 1))::numeric, 0) + 1 as random_rk from (select distinct id from uniq_dataset) ouds) uds,
-- (3) rank table
  (select rank() over(partition by id order by attr) as rk, id ,attr from uniq_dataset) rds
WHERE
  uds.id = rds.id
AND 
  uds.random_rk = rds.rk
ORDER BY
  uds.id;

結果：

 id | attr
----+------
  1 | a
  2 | a
  3 | c

OR

 id | attr
----+------
  1 | a
  2 | b
  3 | c

この SQL のテーブルは次のとおりです。

-- dataset (original table)
 id | attr
----+------
  1 | a
  2 | b
  2 | a
  2 | a
  3 | c

-- (1) uniq dataset
 id | attr
----+------
  1 | a
  2 | a
  2 | b
  3 | c

-- (2) generate random rank for each id
 id | random_rk
----+----
  1 |  1
  2 |  1 <- 1 or 2
  3 |  1

-- (3) rank table
 rk | id | attr
----+----+------
  1 |  1 | a
  1 |  2 | a
  2 |  2 | b
  1 |  3 | c

score 0 · Accepted Answer

これは、関連する質問hereに対する回答です。その質問はクローズされているので、ここに回答を投稿します。

列を文字列に集約する方法は次のとおりです。

select * from temp;
 attribute 
-----------
 a
 c
 b

1) 各行に一意のランクを付ける

with sub_table as(select attribute, rank() over (order by attribute) rnk from temp)
select * from sub_table;

 attribute | rnk 
-----------+-----
 a         |   1
 b         |   2
 c         |   3

2) 連結演算子 || を使用します。一列にまとめる

with sub_table as(select attribute, rank() over (order by attribute) rnk from temp)
select (select attribute from sub_table where rnk = 1)||
       (select attribute from sub_table where rnk = 2)||
       (select attribute from sub_table where rnk = 3) res_string;

 res_string 
------------
 abc

これは、その列の有限数の行 (X) に対してのみ機能します。「order by」句の属性によって並べ替えられた最初の X 行である可能性があります。これは高価だと思います。

Case ステートメントは、特定のランクが存在しない場合に発生する NULL を処理するために使用できます。

with sub_table as(select attribute, rank() over (order by attribute) rnk from temp)
select (select attribute from sub_table where rnk = 1)||
       (select attribute from sub_table where rnk = 2)||
       (select attribute from sub_table where rnk = 3)||
       (case when (select attribute from sub_table where rnk = 4) is NULL then '' 
             else (select attribute from sub_table where rnk = 4) end) as res_string;

sql - Redshift のグループからランダムな属性を選択します

4 に答える 4

Related

Reference