postgresql - PostgreSQL がサブクエリを結合に折りたたむのはいつですか?

Question

次のクエリを検討してください。

select a.id from a
where
    a.id in (select b.a_id from b where b.x='x1' and b.y='y1') and
    a.id in (select b.a_id from b where b.x='x2' and b.y='y2')
order by a.date desc
limit 20

どちらがより高速なものに書き換え可能である必要があります:

select a.id from a inner join b as b1 on (a.id=b1.a_id) inner join b as b2 on (a.id=b2.a_id)
where
    b1.x='x1' and b1.y='y1' and
    b2.x='x2' and b2.y='y2'
order by a.date desc
limit 20

ソースコードを変更してクエリを書き直すのは非常に複雑になるため (特に Django を使用している場合)、できるだけ避けたいと考えています。

したがって、PostgreSQL がサブクエリを結合に折りたたむのはいつで、そうでないのはいつなのだろうか?

これが単純化されたデータモデルです。

                                      Table "public.a"
      Column       |          Type          |                          Modifiers
-------------------+------------------------+-------------------------------------------------------------
 id                | integer                | not null default nextval('a_id_seq'::regclass)
 date              | date                   | 
 content           | character varying(256) | 
Indexes:
    "a_pkey" PRIMARY KEY, btree (id)
    "a_id_date" btree (id, date)
Referenced by:
    TABLE "b" CONSTRAINT "a_id_refs_id_6e634433343d4435353" FOREIGN KEY (a_id) REFERENCES a(id) DEFERRABLE INITIALLY DEFERRED


       Table "public.b"
  Column  |   Type    | Modifiers 
----------+-----------+-----------
 a_id     | integer   | not null
 x        | text      | not null
 y        | text      | not null
Indexes:
    "b_x_y_a_id" UNIQUE CONSTRAINT, btree (x, y, a_id)
Foreign-key constraints:
    "a_id_refs_id_6e634433343d4435353" FOREIGN KEY (a_id) REFERENCES a(id) DEFERRABLE INITIALLY DEFERRED

a には 700 万行あります
b には 7000 万行あります
bx のカーディナリティ = ~100
by = ~100000 のカーディナリティ
bx のカーディナリティ by = ~150000
テーブル c、d、および e が b と同じ構造を持ち、追加で使用して結果の a.ids をさらに減らすことができると想像してください。

PostgreSQL のバージョン、クエリをテストしました。

PostgreSQL 9.2.7 on x86_64-suse-linux-gnu, compiled by gcc (SUSE Linux) 4.7.2 20130108 [gcc-4_7-branch revision 195012], 64-bit
PostgreSQL 9.4beta1 on x86_64-suse-linux-gnu, compiled by gcc (SUSE Linux) 4.7.2 20130108 [gcc-4_7-branch revision 195012], 64-bit

クエリプラン (空のファイルキャッシュとメモリキャッシュを使用):

score 0 · Accepted Answer

        -- The table definitions
CREATE TABLE table_a (
        id     SERIAL NOT NULL PRIMARY KEY
        , d      DATE NOT NULL
        );

CREATE TABLE table_b (
        id     SERIAL NOT NULL PRIMARY KEY
        , a_id INTEGER NOT NULL REFERENCES table_a(id)
        , x VARCHAR NOT NULL
        , y VARCHAR NOT NULL
        );
        -- fake some data
INSERT INTO table_a(d)
SELECT gs
FROM generate_series( '1904-01-01'::timestamp ,'2015-01-01'::timestamp, '1 day'::interval) gs;
INSERT INTO table_b(a_id, x, y) SELECT a.id, 'x1' , 'y1' FROM table_a a;
INSERT INTO table_b(a_id, x, y) SELECT a.id, 'x2' , 'y2' FROM table_a a;
INSERT INTO table_b(a_id, x, y) SELECT a.id, 'x3' , 'y3' FROM table_a a;
DELETE FROM table_b WHERE RANDOM() > 0.3;

CREATE UNIQUE INDEX ON table_a(d, id);  -- date first
CREATE INDEX ON table_b(a_id);          -- supporting the FK

        -- For initialising the statistics
VACUUM ANALYZE table_a;
VACUUM ANALYZE table_b;

        -- original query
EXPLAIN ANALYZE
SELECT a.id
FROM table_a a
WHERE a.id IN (SELECT b.a_id FROM table_b b WHERE b.x='x1' AND b.y='y1')
  AND a.id IN (SELECT b.a_id FROM table_b b WHERE b.x='x2' AND b.y='y2')
order by a.d desc
limit 20;

        -- EXISTS() version
EXPLAIN ANALYZE
SELECT a.id
FROM table_a a
WHERE EXISTS (SELECT * FROM table_b b WHERE b.a_id= a.id AND b.x='x1' AND b.y='y1')
  AND EXISTS (SELECT * FROM table_b b WHERE b.a_id= a.id AND b.x='x2' AND b.y='y2')
order by a.d desc
limit 20;

結果のクエリプラン:

 Limit  (cost=0.87..491.23 rows=20 width=8) (actual time=0.080..0.521 rows=20 loops=1)
   ->  Nested Loop Semi Join  (cost=0.87..15741.40 rows=642 width=8) (actual time=0.080..0.518 rows=20 loops=1)
         ->  Nested Loop Semi Join  (cost=0.58..14380.54 rows=4043 width=12) (actual time=0.017..0.391 rows=74 loops=1)
               ->  Index Only Scan Backward using table_a_d_id_idx on table_a a  (cost=0.29..732.75 rows=40544 width=8) (actual time=0.008..0.048 rows=231 loops=1)
                     Heap Fetches: 0
               ->  Index Scan using table_b_a_id_idx on table_b b_1  (cost=0.29..0.34 rows=1 width=4) (actual time=0.001..0.001 rows=0 loops=231)
                     Index Cond: (a_id = a.id)
                     Filter: (((x)::text = 'x2'::text) AND ((y)::text = 'y2'::text))
                     Rows Removed by Filter: 0
         ->  Index Scan using table_b_a_id_idx on table_b b  (cost=0.29..0.34 rows=1 width=4) (actual time=0.001..0.001 rows=0 loops=74)
               Index Cond: (a_id = a.id)
               Filter: (((x)::text = 'x1'::text) AND ((y)::text = 'y1'::text))
               Rows Removed by Filter: 1
 Total runtime: 0.547 ms

2 つのクエリは、まったく同じクエリプランと結果を引き起こします ( の NOT NULL のためtableb.a_id) 。
ハッシュ結合よりもインデックス結合を好む場合、インデックスtable_b(a_id)は絶対に必要です (7M//70M タプルでは、インデックススキャンを優先する必要があると思います)。
外側のクエリでの並べ替え (コストがかかる) が回避されました ( index を使用table_a(d, id))

postgresql - PostgreSQL がサブクエリを結合に折りたたむのはいつですか?

2 に答える 2

Related

Reference