sql - SQLの最適化が必要です（おそらくDISTINCT ONが理由ですか？）

Question

関連する前の質問：
値（列ではない）でグループ化した後、グループからランダムなエントリを選択しますか？

私の現在のクエリは次のようになります。

WITH
  points AS (
    SELECT unnest(array_of_points) AS p
  ),

 gtps AS (
   SELECT DISTINCT ON(points.p)
     points.p, m.groundtruth
   FROM measurement m, points
   WHERE st_distance(m.groundtruth, points.p) < distance
   ORDER BY points.p, RANDOM()
 )

SELECT DISTINCT ON(gtps.p, gtps.groundtruth, m.anchor_id)
  m.id, m.anchor_id, gtps.groundtruth, gtps.p
FROM measurement m, gtps
ORDER BY gtps.p, gtps.groundtruth, m.anchor_id, RANDOM()

セマンティクス：

2つの入力値があります。
- 4行目：ポイントの配列array_of_points
- 12行目：倍精度数：distance
最初の段落（1〜6行目）：
- で使用するためにpoints配列からテーブルを作成します...
2番目の段落（8〜14行目）：
- テーブル内の各ポイントについてpoints：テーブルからランダム（！）groundtruthポイントを取得します。measurement距離は<distance
- それらのタプルをgtpsテーブル内に保存します
3番目の段落（16〜19行目）：
- テーブルgroundtruth内の値ごとに：すべての値を取得して...gtpsanchor_id
- 値が一意でない場合anchor_id：次にランダムな値を選択します
出力：、、、id（anchor_idからgroundtruthのp入力値array_of_points）

表の例：

id | anchor_id | groundtruth | data
-----------------------------------
1  | 1         | POINT(1 4)  | ...
2  | 3         | POINT(1 4)  | ...
3  | 8         | POINT(1 4)  | ...
4  | 6         | POINT(1 4)  | ...
-----------------------------------
5  | 2         | POINT(3 2)  | ...
6  | 4         | POINT(3 2)  | ...
-----------------------------------
7  | 1         | POINT(4 3)  | ...
8  | 1         | POINT(4 3)  | ...
9  | 6         | POINT(4 3)  | ...
10 | 7         | POINT(4 3)  | ...
11 | 3         | POINT(4 3)  | ...
-----------------------------------
12 | 1         | POINT(6 2)  | ...
13 | 5         | POINT(6 2)  | ...

結果の例：

id  | anchor_id | groundtruth | p
-----------------------------------------
1   | 1         | POINT(1 4)  | POINT(1 0)
2   | 3         | POINT(1 4)  | POINT(1 0)
4   | 6         | POINT(1 4)  | POINT(1 0)
3   | 8         | POINT(1 4)  | POINT(1 0)
5   | 2         | POINT(3 2)  | POINT(2 2)
6   | 4         | POINT(3 2)  | POINT(2 2)
1   | 1         | POINT(1 4)  | POINT(4 8)
2   | 3         | POINT(1 4)  | POINT(4 8)
4   | 6         | POINT(1 4)  | POINT(4 8)
3   | 8         | POINT(1 4)  | POINT(4 8)
12  | 1         | POINT(6 2)  | POINT(7 3)
13  | 5         | POINT(6 2)  | POINT(7 3)
1   | 1         | POINT(4 3)  | POINT(9 1)
11  | 3         | POINT(4 3)  | POINT(9 1)
9   | 6         | POINT(4 3)  | POINT(9 1)
10  | 7         | POINT(4 3)  | POINT(9 1)

ご覧のように：

各入力値は、複数の等しいgroundtruth値を持つことができます。
入力値に複数のgroundtruth値がある場合、それらはすべて等しくなければなりません。
各groundtruth-inputPoint-tupleはanchor_id、そのgroundtruthのすべての可能性に接続されています。
2つの異なる入力値は、同じ対応するgroundtruth値を持つことができます。
2つの異なるgroundtruth-inputPoint-tuplesは同じを持つことができますanchor_id
2つの同一のgroundtruth-inputPoint-tuplesは異なるanchor_idsを持っている必要があります

ベンチマーク（2つの入力値の場合）：

1〜6行目：16ミリ秒
8〜14行目：48ミリ秒
16〜19行目：600ミリ秒

説明の言葉：

Unique  (cost=11119.32..11348.33 rows=18 width=72)
  Output: m.id, m.anchor_id, gtps.groundtruth, gtps.p, (random())
  CTE points
    ->  Result  (cost=0.00..0.01 rows=1 width=0)
          Output: unnest('{0101000000EE7C3F355EF24F4019390B7BDA011940:01010000003480B74082FA44402CD49AE61D173C40}'::geometry[])
  CTE gtps
    ->  Unique  (cost=7659.95..7698.12 rows=1 width=160)
          Output: points.p, m.groundtruth, (random())
          ->  Sort  (cost=7659.95..7679.04 rows=7634 width=160)
                Output: points.p, m.groundtruth, (random())
                Sort Key: points.p, (random())
                ->  Nested Loop  (cost=0.00..6565.63 rows=7634 width=160)
                      Output: points.p, m.groundtruth, random()
                      Join Filter: (st_distance(m.groundtruth, points.p) < m.distance)
                      ->  CTE Scan on points  (cost=0.00..0.02 rows=1 width=32)
                            Output: points.p
                      ->  Seq Scan on public.measurement m  (cost=0.00..535.01 rows=22901 width=132)
                            Output: m.id, m.anchor_id, m.tag_node_id, m.experiment_id, m.run_id, m.anchor_node_id, m.groundtruth, m.distance, m.distance_error, m.distance_truth, m."timestamp"
  ->  Sort  (cost=3421.18..3478.43 rows=22901 width=72)
        Output: m.id, m.anchor_id, gtps.groundtruth, gtps.p, (random())
        Sort Key: gtps.p, gtps.groundtruth, m.anchor_id, (random())
        ->  Nested Loop  (cost=0.00..821.29 rows=22901 width=72)
              Output: m.id, m.anchor_id, gtps.groundtruth, gtps.p, random()
              ->  CTE Scan on gtps  (cost=0.00..0.02 rows=1 width=64)
                    Output: gtps.p, gtps.groundtruth
              ->  Seq Scan on public.measurement m  (cost=0.00..535.01 rows=22901 width=8)
                    Output: m.id, m.anchor_id, m.tag_node_id, m.experiment_id, m.run_id, m.anchor_node_id, m.groundtruth, m.distance, m.distance_error, m.distance_truth, m."timestamp"

説明分析：

Unique  (cost=11119.32..11348.33 rows=18 width=72) (actual time=548.991..657.992 rows=36 loops=1)
  CTE points
    ->  Result  (cost=0.00..0.01 rows=1 width=0) (actual time=0.004..0.011 rows=2 loops=1)
  CTE gtps
    ->  Unique  (cost=7659.95..7698.12 rows=1 width=160) (actual time=133.416..146.745 rows=2 loops=1)
          ->  Sort  (cost=7659.95..7679.04 rows=7634 width=160) (actual time=133.415..142.255 rows=15683 loops=1)
                Sort Key: points.p, (random())
                Sort Method: external merge  Disk: 1248kB
                ->  Nested Loop  (cost=0.00..6565.63 rows=7634 width=160) (actual time=0.045..46.670 rows=15683 loops=1)
                      Join Filter: (st_distance(m.groundtruth, points.p) < m.distance)
                      ->  CTE Scan on points  (cost=0.00..0.02 rows=1 width=32) (actual time=0.007..0.020 rows=2 loops=1)
                      ->  Seq Scan on measurement m  (cost=0.00..535.01 rows=22901 width=132) (actual time=0.013..3.902 rows=22901 loops=2)
  ->  Sort  (cost=3421.18..3478.43 rows=22901 width=72) (actual time=548.989..631.323 rows=45802 loops=1)
        Sort Key: gtps.p, gtps.groundtruth, m.anchor_id, (random())"
        Sort Method: external merge  Disk: 4008kB
        ->  Nested Loop  (cost=0.00..821.29 rows=22901 width=72) (actual time=133.449..166.294 rows=45802 loops=1)
              ->  CTE Scan on gtps  (cost=0.00..0.02 rows=1 width=64) (actual time=133.420..146.753 rows=2 loops=1)
              ->  Seq Scan on measurement m  (cost=0.00..535.01 rows=22901 width=8) (actual time=0.014..4.409 rows=22901 loops=2)
Total runtime: 834.626 ms

ライブで実行する場合、これは約100〜1000の入力値で実行する必要があります。したがって、今のところ35〜350秒かかりますが、これはかなりの時間です。

RANDOM()私はすでに関数を削除しようとしました。これにより、実行時間（2つの入力値の場合）が約670ミリ秒から約530ミリ秒に短縮されます。したがって、これは現時点では主な影響ではありません。

より簡単/高速であれば、2つまたは3つの個別のクエリを実行し、ソフトウェアで一部の部分を実行することもできます（Ruby on Railsサーバーで実行されます）。たとえば、ランダムな選択？！

進行中の作業：

SELECT
  m.groundtruth, ps.p, ARRAY_AGG(m.anchor_id), ARRAY_AGG(m.id)
FROM
  measurement m
JOIN
  (SELECT unnest(point_array) AS p) AS ps
  ON ST_DWithin(ps.p, m.groundtruth, distance)
GROUP BY groundtruth, ps.p

このクエリでは非常に高速ですが（15ms）、多くの欠落があります。

それぞれにランダムな行が必要ですps.p
2つのアレイは互いに属します。手段：中のアイテムの順序は重要です！
これらの2つの配列は（ランダムに）フィルタリングする必要があります。複数回出現する配列内の
それぞれについて：ランダムな配列を保持し、他のすべてを削除します。これは、削除されるたびに対応する配列をanchor_id削除することも意味しますididanchor_id

anchor_idまた、タプルの配列内にid格納できれば便利です。例:({[4,1],[6,3],[4,2],[8,5],[4,4]}制約：すべてのタプルは一意であり、すべてのID（==この例では2番目の値）は一意であり、anchor_idsは一意ではありません）。この例は、まだ適用する必要のあるフィルターなしでクエリを表示します。フィルタを適用すると、次のようになります{[6,3],[4,4],[8,5]}。

進行中の作業II：

SELECT DISTINCT ON (ps.p)
  m.groundtruth, ps.p, ARRAY_AGG(m.anchor_id), ARRAY_AGG(m.id)
FROM
  measurement m
JOIN
  (SELECT unnest(point_array) AS p) AS ps
  ON ST_DWithin(ps.p, m.groundtruth, distance)
GROUP BY ps.p, m.groundtruth
ORDER BY ps.p, RANDOM()

これにより、非常に優れた結果が得られ、それでも非常に高速です。16ms
実行することが1つだけ残っています。

ARRAY_AGG(m.anchor_id)すでにランダム化されていますが、：
重複するエントリがたくさん含まれているため、次のようになります。
DISTINCTのようなものを使用したいのですが、次のようになります。
と同期する必要がありARRAY_AGG(m.id)ます。つまり
、DISTINCTコマンドが配列のインデックス1、4、および7を保持する場合、anchor_id配列のインデックス1、4、および7も保持する必要がありますid（もちろん、他のすべてを削除します）。

score 2 · Accepted Answer

また、anchor_idとidをタプルの配列内に格納できると便利です。

多次元配列の集計関数

そのために2次元配列を作成すると思います。これは、よりも扱いやすいARRAY of recordです。標準array_agg()では、多次元配列を集約できません。ただし、そのために独自の集計関数をかなり簡単に作成できます。

CREATE AGGREGATE array_agg_mult (anyarray)  (
    SFUNC     = array_cat
   ,STYPE     = anyarray
   ,INITCOND  = '{}'
);

この関連する回答の説明を読んでください：
Postgres配列へのデータの選択

複数回出現する配列内のanchor_idごとに、ランダムなものを保持し、他のすべてを削除します。これは、削除されたすべてのanchor_idのid配列から対応するIDを削除することも意味します

クエリ

SELECT DISTINCT ON (p)
       p, groundtruth, array_agg_mult(ARRAY[ARRAY[anchor_id, id]]) AS ids
FROM (
   SELECT DISTINCT ON (ps.p, m.groundtruth, m.anchor_id)
          ps.p, m.groundtruth, m.anchor_id, m.id
   FROM  (SELECT unnest(point_array) AS p) AS ps
   JOIN   measurement m ON ST_DWithin(ps.p, m.groundtruth, distance)
   ORDER  BY ps.p, m.groundtruth, m.anchor_id, random()
   ) x
GROUP  BY p, groundtruth
ORDER  BY p, random();

サブクエリはごとxに区別され、複数のピアがある場合はランダムな行を選択します。このようにして、接続はそのまま維持されます。anchor_id(p, groundtruth)anchor_id - id
外側のクエリは、希望どおりに2次元配列を集計しますanchor_id。ランダムに注文したい場合はanchor_id、もう一度ランダムを使用してください。
```
array_agg_mult(ARRAY[ARRAY[anchor_id, id]] ORDER BY random())
```
そして最後に、ランダムに1DISTINCT ONつだけピックします。groundtruthp

sql - SQLの最適化が必要です（おそらくDISTINCT ONが理由ですか？）

進行中の作業：

進行中の作業II：

1 に答える 1

多次元配列の集計関数

クエリ

Related

Reference