database - Postgresqlクエリの最適化内部/外部結合は許可されていません

Question

POSTGRESQL9.2で最適化するためにこのクエリが与えられます。

SELECT C.name, COUNT(DISTINCT I.id) AS NumItems, COUNT(B.id)
FROM Categories C INNER JOIN Items I ON(C.id = I.category) 
                  INNER JOIN Bids B ON (I.id = B.item_id)
GROUP BY C.name

私の学校の割り当ての一部として。

それぞれのテーブルにこれらのインデックスを作成しました：items(category)-> 2ndary b + tree、bids(item_id)-> 2ndary b + tree、categories(id)->ここでのプライマリインデックス、

奇妙な部分は、PostgreSQLがItems、Categories、Bidsテーブルを順次スキャンしていることです。を設定するenable_seqscan=offと、インデックス検索は以下の結果よりも恐ろしいことがわかります。

PostgreSQLでexplainを実行すると、これが結果になります。重要であるため、インデントを削除しないでください。

GroupAggregate  (cost=119575.55..125576.11 rows=20 width=23) (actual time=6912.523..9459.431 rows=20 loops=1)
  Buffers: shared hit=30 read=12306, temp read=6600 written=6598
  ->  Sort  (cost=119575.55..121075.64 rows=600036 width=23) (actual time=6817.015..8031.285 rows=600036 loops=1)
        Sort Key: c.name
        Sort Method: external merge  Disk: 20160kB
        Buffers: shared hit=30 read=12306, temp read=6274 written=6272
        ->  Hash Join  (cost=9416.95..37376.03 rows=600036 width=23) (actual time=407.974..3322.253 rows=600036 loops=1)
              Hash Cond: (b.item_id = i.id)
              Buffers: shared hit=30 read=12306, temp read=994 written=992
              ->  Seq Scan on bids b  (cost=0.00..11001.36 rows=600036 width=8) (actual time=0.009..870.898 rows=600036 loops=1)
                    Buffers: shared hit=2 read=4999
              ->  Hash  (cost=8522.95..8522.95 rows=50000 width=19) (actual time=407.784..407.784 rows=50000 loops=1)
                    Buckets: 4096  Batches: 2  Memory Usage: 989kB
                    Buffers: shared hit=28 read=7307, temp written=111
                    ->  Hash Join  (cost=1.45..8522.95 rows=50000 width=19) (actual time=0.082..313.211 rows=50000 loops=1)
                          Hash Cond: (i.category = c.id)
                          Buffers: shared hit=28 read=7307
                          ->  Seq Scan on items i  (cost=0.00..7834.00 rows=50000 width=8) (actual time=0.004..144.554 rows=50000 loops=1)
                                Buffers: shared hit=27 read=7307
                          ->  Hash  (cost=1.20..1.20 rows=20 width=19) (actual time=0.062..0.062 rows=20 loops=1)
                                Buckets: 1024  Batches: 1  Memory Usage: 1kB
                                Buffers: shared hit=1
                                ->  Seq Scan on categories c  (cost=0.00..1.20 rows=20 width=19) (actual time=0.004..0.028 rows=20 loops=1)
                                      Buffers: shared hit=1
Total runtime: 9473.257 ms

Explain.depesz.comでこの計画を参照してください。

なぜこれが発生するのか、つまりインデックスがシーケンシャルスキャンと比較してクエリをひどく遅くする理由を知りたいだけです。

編集：postgresqlのドキュメントを読むことで、いくつかのことを発見できたと思います。Postgresqlは、テーブル内のすべての行を取得する必要があると予測したため、入札やアイテムなどの一部のテーブルでseqスキャンを実行することを決定しました（実際の時間の前の括弧内の行数と実際の時間部分の行数を比較してください））。シーケンシャルスキャンは、すべての行を取得するのに適しています。その部分では何もできません。

の追加のインデックスを作成しましcategories(name)た。以下の結果は私が持っているものです。どういうわけか改善されましたが、ハッシュ結合がネストされたループに置き換えられました。理由の手がかりはありますか？

GroupAggregate  (cost=0.00..119552.02 rows=20 width=23) (actual time=617.330..7725.314 rows=20 loops=1)
  Buffers: shared hit=178582 read=37473 written=14, temp read=2435 written=436
  ->  Nested Loop  (cost=0.00..115051.55 rows=600036 width=23) (actual time=0.120..6186.496 rows=600036 loops=1)
        Buffers: shared hit=178582 read=37473 written=14, temp read=2109 written=110
        ->  Nested Loop  (cost=0.00..26891.55 rows=50000 width=19) (actual time=0.066..2827.955 rows=50000 loops=1)
              Join Filter: (c.id = i.category)
              Rows Removed by Join Filter: 950000
              Buffers: shared hit=2 read=7334 written=1, temp read=2109 written=110
              ->  Index Scan using categories_name_idx on categories c  (cost=0.00..12.55 rows=20 width=19) (actual time=0.039..0.146 rows=20 loops=1)
                    Buffers: shared hit=1 read=1
              ->  Materialize  (cost=0.00..8280.00 rows=50000 width=8) (actual time=0.014..76.908 rows=50000 loops=20)
                    Buffers: shared hit=1 read=7333 written=1, temp read=2109 written=110
                    ->  Seq Scan on items i  (cost=0.00..7834.00 rows=50000 width=8) (actual time=0.007..170.464 rows=50000 loops=1)
                          Buffers: shared hit=1 read=7333 written=1
        ->  Index Scan using bid_itemid_idx on bids b  (cost=0.00..1.60 rows=16 width=8) (actual time=0.016..0.036 rows=12 loops=50000)
              Index Cond: (item_id = i.id)
              Buffers: shared hit=178580 read=30139 written=13
Total runtime: 7726.392 ms

それがより良い場合は、ここで計画を見てください。

category（id）とにインデックスを作成することで、なんとか114062.92に減らすことができましたitems(category)。Postgresqlは両方のインデックスを使用して114062.92のコストを達成しました。しかし、現在、postgresqlはインデックスを使用せずに私とゲームをしています！なぜそんなにバグがあるのですか？

score 1 · Accepted Answer

Thankyou for posting EXPLAIN output without being asked, and for the EXPLAIN (BUFFERS, ANALYZE).

A significant part of your query's performance issue is likely to be the outer sort plan node, which is doing an on-disk merge sort with a temporary file:

Sort Method: external merge Disk: 20160kB

You could do this sort in memory by setting:

SET work_mem = '50MB';

before running your query. This setting can also be set per-user, per-database or globally in postgresql.conf.

I'm not convinced that adding indexes will be of much benefit as the query is currently structured. It needs to read and join all rows from all three tables, and hash joins are likely to be the fastest way to do so.

I suspect there are other ways to express that query that will use entirely different and more efficient execution strategies, but I'm having a brain-fade about what they might be and don't want to spend the time to make up dummy tables to play around. More work_mem should significantly improve the query as it stands.

score 0 · Accepted Answer

のサイズがbidsそれよりも体系的かつ大幅に大きい場合、 (特にRAM に収まる場合) 結合結果からそれらの個別のアイテム ID を選択するよりも (メモリ内で並べ替えたとしても) 2 回itemsトラバースする方が実際には安価である可能性があることに注意してください。 . さらに、Postgres パイプラインが重複したテーブルからデータを取得する方法によっては、負荷やメモリの状態が悪い場合でも限定的なペナルティが発生する可能性があります (これはpgsql-generalで尋ねることができる良い質問です)。itemsitems

SELECT name, IC.cnt, BC.cnt FROM
Categories C,
( SELECT category, count(1) cnt from Items I GROUP BY category ) IC,
( SELECT category, count(1) cnt from Bids B INNER JOIN Items I ON (I.id = B.item_id) GROUP BY category ) BC
WHERE IC.category=C.id AND BC.category=id;

どれくらい安い？十分なキャッシングが与えられた場合、少なくとも 4 倍、つまり 610 ミリ秒対 2500 ミリ秒 (メモリ内ソート)、20 カテゴリ、50k アイテム、600k 入札で、私のマシンでファイルシステムキャッシュをフラッシュした後でも 2 倍より高速です。

上記は元のクエリの直接の代替ではないことに注意してください。1 つは、カテゴリ ID と名前の間に 1:1 のマッピングがあることを前提としています (これは非常に合理的な仮定であることが判明する可能性があります。そうでない場合は、単純SUM(BC.cnt)にSUM(IC.cnt)、あなたのようにGROUP BY name)。しかし、より重要なことに、カテゴリごとのアイテム数には次のアイテムが含まれます。元のとは異なり、入札はありませんINNER JOIN。入札アイテム数のみが必要な場合はWHERE EXISTS (SELECT 1 FROM Bids B where item_id=I.id)、IC サブクエリに追加できます。これは 2 回目もトラバースBidsします (私の場合、既存の ~600ms プランに ~200ms のペナルティが追加されましたが、それでも 2400ms をはるかに下回っています)。

score 0 · Accepted Answer

クエリプランから、次のことがわかります
。 1. 結果とカテゴリには 20 のレコードがあります
2. カテゴリのあるアイテムはアイテムの全量の 5% であり、
「結合フィルターによって削除された行: 950000」
「行 = 50000」がシーケンシャルスキャンで表示され
ます 3. 一致した入札はrows=600036です（入札の総数を教えていただけますか？）
4.すべてのカテゴリに入札がありますか？

そのため、アイテム (カテゴリ) と入札 (item_id) にインデックスを使用したいと考えています。また、メモリに収まるように並べ替えたいと考えています。

 select  
   (select name from Categories where id = foo.category) as name, 
   count(foo.id),  
   sum(foo.bids_count)  
 from 
   (select 
      id,  
      category,  
      (select count(item_id) from Bids where item_id = i.id) as bids_count  
    from Items i  
    where category in (select id from Categories)  
      and exists (select 1 from Bids where item_id = i.id)  
   ) as foo  
  group by foo.category  
  order by name

もちろん、ポイント 1 と 2 のデータに厳密に依存することを覚えておく必要があります。

4 が true の場合、存在する部分をクエリから削除できます。

アドバイスやアイデアはありますか？

database - Postgresqlクエリの最適化内部/外部結合は許可されていません

3 に答える 3

Related

Reference