sql - 重複のない親行のみをクエリする SQL 結合 1-many

Question

との 2 つのテーブルがinvoicesありinvoiceitemsます。関係は 1 対多です。私のアプリケーションでは、クエリの請求書項目フィールドを使用して請求書を照会できます。請求書のみが返され、アイテムは返されません。

acたとえば、名前にが含まれ、大文字と小文字が区別されないアイテムを含むすべての請求書を取得したいと考えています。出力はページ分割されているため、1 つのクエリを実行して条件を満たす請求書の数を取得し、次に別のクエリを実行して請求書の各ページを取得します。

テーブルのサイズは次のとおりです。

請求書 - 65,000 レコード
請求項目 - 3,281,518 レコード
用語 - 5 項目
担当者 - 5 項目
shipVia - 5 アイテム

各請求書は、最大 100 の請求書項目にリンクされます。

私の問題は、クエリに最適なインデックスを見つけられないことです。

スキーマ:

CREATE TABLE invoiceitems
(
  id serial NOT NULL,
  invoice_id integer NOT NULL,
  name text NOT NULL,
  ...
  CONSTRAINT invoiceitems_pkey PRIMARY KEY (id),
  CONSTRAINT invoiceitems_invoice_id_fkey FOREIGN KEY (invoice_id)
      REFERENCES invoices (id) MATCH SIMPLE
      ON UPDATE NO ACTION ON DELETE NO ACTION,
);

CREATE INDEX idx_lower_name
  ON invoiceitems
  USING btree
  (lower(name) COLLATE pg_catalog."default" text_pattern_ops);

CREATE TABLE invoices
(
  id serial NOT NULL,
  term_id integer,
  rep_id integer NOT NULL,
  ship_via_id integer,
  ...
  CONSTRAINT invoices_pkey PRIMARY KEY (id),
  CONSTRAINT invoices_rep_id_fkey FOREIGN KEY (rep_id)
      REFERENCES reps (id) MATCH SIMPLE
      ON UPDATE NO ACTION ON DELETE NO ACTION,
  CONSTRAINT invoices_ship_via_id_fkey FOREIGN KEY (ship_via_id)
      REFERENCES shipvia (id) MATCH SIMPLE
      ON UPDATE NO ACTION ON DELETE NO ACTION,
  CONSTRAINT invoices_term_id_fkey FOREIGN KEY (term_id)
      REFERENCES terms (id) MATCH SIMPLE
      ON UPDATE NO ACTION ON DELETE NO ACTION,
);

カウントクエリ:

SELECT COUNT(DISTINCT(o.id))
FROM invoices o
JOIN invoiceitems items ON items.invoice_id = o.id
LEFT JOIN terms t ON t.id = o.term_id
LEFT JOIN reps r ON r.id = o.rep_id
LEFT JOIN shipVia s ON s.id = o.ship_via_id WHERE LOWER(items.name) LIKE '%ac%';

結果：

6518

クエリプラン

"Aggregate  (cost=107651.35..107651.36 rows=1 width=4)"
"  ->  Hash Join  (cost=3989.50..106010.59 rows=656304 width=4)"
"        Hash Cond: (items.invoice_id = o.id)"
"        ->  Seq Scan on invoiceitems items  (cost=0.00..85089.77 rows=656304 width=4)"
"              Filter: (lower(name) ~~ '%ac%'::text)"
"        ->  Hash  (cost=2859.00..2859.00 rows=65000 width=16)"
"              ->  Seq Scan on invoices o  (cost=0.00..2859.00 rows=65000 width=16)"

フィールドでの機能指標invoiceitems.nameがまったく機能していないようです。名前の厳密な接頭辞ではない名前の一部を探しているためだと思います。よくわかりませんが、請求書の主キーインデックスもここでは機能しないようです。

私の質問は、カウントクエリやスキーマを最適化してパフォーマンスを向上させることはできますか?

厳密なプレフィックスではない名前の一部による検索を許可する必要があり、大文字と小文字を区別しない検索もサポートする必要があります。

一致するレコードを返すクエリも同様に悪いです:

SELECT DISTINCT(o.id), t.terms, r.rep, s.ship_via, ...
FROM invoices o
JOIN invoiceitems items ON items.invoice_id = o.id
LEFT JOIN terms t ON t.id = o.term_id
LEFT JOIN reps r ON r.id = o.rep_id
LEFT JOIN shipVia s ON s.id = o.ship_via_id WHERE LOWER(items.name) LIKE '%ac%' LIMIT 100;

そしてその計画：

"Limit  (cost=901846.63..901854.13 rows=100 width=627)"
"  ->  Unique  (cost=901846.63..951069.43 rows=656304 width=627)"
"        ->  Sort  (cost=901846.63..903487.39 rows=656304 width=627)"
"              Sort Key: o.id, t.terms, r.rep, s.ship_via, ..."
"              ->  Hash Join  (cost=11509.54..286596.53 rows=656304 width=627)"
"                    Hash Cond: (items.invoice_id = o.id)"
"                    ->  Seq Scan on invoiceitems items  (cost=0.00..85089.77 rows=656304 width=4)"
"                          Filter: (lower(name) ~~ '%ac%'::text)"
"                    ->  Hash  (cost=5491.03..5491.03 rows=65000 width=627)"
"                          ->  Hash Left Join  (cost=113.02..5491.03 rows=65000 width=627)"
"                                Hash Cond: (o.ship_via_id = s.id)"
"                                ->  Hash Left Join  (cost=75.35..4559.61 rows=65000 width=599)"
"                                      Hash Cond: (o.rep_id = r.id)"
"                                      ->  Hash Left Join  (cost=37.67..3628.19 rows=65000 width=571)"
"                                            Hash Cond: (o.term_id = t.id)"
"                                            ->  Seq Scan on invoices o  (cost=0.00..2859.00 rows=65000 width=543)"
"                                            ->  Hash  (cost=22.30..22.30 rows=1230 width=36)"
"                                                  ->  Seq Scan on terms t  (cost=0.00..22.30 rows=1230 width=36)"
"                                      ->  Hash  (cost=22.30..22.30 rows=1230 width=36)"
"                                            ->  Seq Scan on reps r  (cost=0.00..22.30 rows=1230 width=36)"
"                                ->  Hash  (cost=22.30..22.30 rows=1230 width=36)"
"                                      ->  Seq Scan on shipvia s  (cost=0.00..22.30 rows=1230 width=36)"

私はPostgreSQLに限定されています。SQL Server への切り替えはオプションではありません。

編集 =============================================== =================

私はアーウィンの非常に有益な指示に従いました。ここに私が持っているものがあります.

インデックス:

CREATE INDEX invoiceitems_name_gin_trgm_idx ON invoiceitems USING gin (name gin_trgm_ops);

JOIN を使用したカウントクエリですが、余分なテーブルはありません。

EXPLAIN ANALYZE SELECT COUNT(DISTINCT(o.id)) 
FROM invoices o 
JOIN invoiceitems items ON items.invoice_id = o.id 
WHERE items.name ILIKE '%ac%';

"Aggregate  (cost=78961.52..78961.53 rows=1 width=4) (actual time=5205.448..5205.450 rows=1 loops=1)"
"  ->  Nested Loop  (cost=0.00..78960.73 rows=316 width=4) (actual time=0.396..5176.761 rows=6518 loops=1)"
"        ->  Seq Scan on invoiceitems items  (cost=0.00..76885.98 rows=316 width=4) (actual time=0.021..4502.043 rows=6518 loops=1)"
"              Filter: (name ~~* '%ac%'::text)"
"              Rows Removed by Filter: 3275000"
"        ->  Index Only Scan using invoices_pkey on invoices o  (cost=0.00..6.56 rows=1 width=4) (actual time=0.012..0.015 rows=1 loops=6518)"
"              Index Cond: (id = items.invoice_id)"
"              Heap Fetches: 6518"
"Total runtime: 5205.509 ms"

準結合を使用したカウントクエリ:

EXPLAIN ANALYZE SELECT COUNT(1)
FROM   invoices o
WHERE EXISTS (
   SELECT 1
   FROM   invoiceitems i 
   WHERE  i.invoice_id = o.id
   AND    i.name ILIKE '%ac%'
   );

"Aggregate  (cost=76920.43..76920.44 rows=1 width=0) (actual time=5713.597..5713.598 rows=1 loops=1)"
"  ->  Nested Loop  (cost=76886.76..76919.64 rows=316 width=0) (actual time=5583.706..5703.801 rows=6518 loops=1)"
"        ->  HashAggregate  (cost=76886.76..76886.82 rows=5 width=4) (actual time=5583.568..5594.977 rows=6518 loops=1)"
"              ->  Seq Scan on invoiceitems i  (cost=0.00..76885.98 rows=316 width=4) (actual time=0.295..5148.801 rows=6518 loops=1)"
"                    Filter: (name ~~* '%ac%'::text)"
"                    Rows Removed by Filter: 3275000"
"        ->  Index Only Scan using invoices_pkey on invoices o  (cost=0.00..6.56 rows=1 width=4) (actual time=0.006..0.008 rows=1 loops=6518)"
"              Index Cond: (id = i.invoice_id)"
"              Heap Fetches: 6518"
"Total runtime: 5713.804 ms"

準結合は効果がないようです。なんで？

(関係ないと思いますが、の元の関数インデックスを削除しましたlower(invoiceitems.name))。

編集 2 ============================================== =================

fetch rows クエリに焦点を当てて、もう少しコンテキストを提供したいと思います。

まず、ユーザーは請求書の任意のフィールドで列を並べ替えることができます (ただし、請求書の項目からではありません)。

さらに、ユーザーは、請求書と請求書項目の両方のフィールドを含むフィルターステートメントのリストを提供できます。これらのフィルターステートメントは、文字列または数値によるフィルター処理のセマンティクスをキャプチャします。

すべてのフィールドをインデックス化する可能性は低いことは明らかです。おそらく、請求書の項目名やその他のいくつかのフィールドなど、最も一般的なフィールドのみをインデックス化する必要があります。

とにかく、ここに私がこれまでに持っているinvoicesとinvoiceitemsテーブルのインデックスがあります:

請求書

主キーとしての id

請求書項目

主キーとしての id
CREATE INDEX invoiceitems_invoice_id_idx ON invoiceitems USING btree (invoice_id);
CREATE INDEX invoiceitems_name_gin_trgm_idx ON invoiceitems USING gin (name COLLATE pg_catalog."default" gin_trgm_ops);

請求書アイテムに対して JOIN を使用したフェッチ行クエリの分析を次に示します。

explain analyze
SELECT DISTINCT(o.id), t.terms, r.rep, s.ship_via, ...
FROM invoices o
JOIN invoiceitems items ON items.invoice_id = o.id
LEFT JOIN terms t ON t.id = o.term_id
LEFT JOIN reps r ON r.id = o.rep_id
LEFT JOIN shipVia s ON s.id = o.ship_via_id 
WHERE (items.name ILIKE '%df%' AND items.name IS NOT NULL) LIMIT 100;

"Limit  (cost=79100.70..79106.95 rows=100 width=312) (actual time=4637.195..4637.195 rows=0 loops=1)"
"  ->  Unique  (cost=79100.70..79120.45 rows=316 width=312) (actual time=4637.190..4637.190 rows=0 loops=1)"
"        ->  Sort  (cost=79100.70..79101.49 rows=316 width=312) (actual time=4637.186..4637.186 rows=0 loops=1)"
"              Sort Key: o.id, o.customer, o.business_no, o.bill_to_name, o.bill_to_address1, o.bill_to_address2, o.bill_to_postal_code, o.ship_to_name, o.ship_to_address1, o.ship_to_address2, o.ship_to_postal_code, o.purchase_order_no, t.terms, r.rep, ((o.ship_date)::text), s.ship_via, o.delivery, o.hst_percents, o.sub_total, o.total_before_hst, o.total, o.total_discount, o.hst, o.item_count"
"              Sort Method: quicksort  Memory: 25kB"
"              ->  Hash Left Join  (cost=113.02..79087.58 rows=316 width=312) (actual time=4637.179..4637.179 rows=0 loops=1)"
"                    Hash Cond: (o.ship_via_id = s.id)"
"                    ->  Hash Left Join  (cost=75.35..79043.98 rows=316 width=284) (actual time=4637.123..4637.123 rows=0 loops=1)"
"                          Hash Cond: (o.rep_id = r.id)"
"                          ->  Hash Left Join  (cost=37.67..79001.96 rows=316 width=256) (actual time=4637.119..4637.119 rows=0 loops=1)"
"                                Hash Cond: (o.term_id = t.id)"
"                                ->  Nested Loop  (cost=0.00..78960.73 rows=316 width=228) (actual time=4637.115..4637.115 rows=0 loops=1)"
"                                      ->  Seq Scan on invoiceitems items  (cost=0.00..76885.98 rows=316 width=4) (actual time=4637.108..4637.108 rows=0 loops=1)"
"                                            Filter: ((name IS NOT NULL) AND (name ~~* '%df%'::text))"
"                                            Rows Removed by Filter: 3281518"
"                                      ->  Index Scan using invoices_pkey on invoices o  (cost=0.00..6.56 rows=1 width=228) (never executed)"
"                                            Index Cond: (id = items.invoice_id)"
"                                ->  Hash  (cost=22.30..22.30 rows=1230 width=36) (never executed)"
"                                      ->  Seq Scan on terms t  (cost=0.00..22.30 rows=1230 width=36) (never executed)"
"                          ->  Hash  (cost=22.30..22.30 rows=1230 width=36) (never executed)"
"                                ->  Seq Scan on reps r  (cost=0.00..22.30 rows=1230 width=36) (never executed)"
"                    ->  Hash  (cost=22.30..22.30 rows=1230 width=36) (never executed)"
"                          ->  Seq Scan on shipvia s  (cost=0.00..22.30 rows=1230 width=36) (never executed)"
"Total runtime: 4637.731 ms"

請求書項目に対して JOIN の代わりに WHERE EXISTS を使用したフェッチ行クエリの分析を次に示します。

explain analyze
SELECT o.id, t.terms, r.rep, s.ship_via, ...
FROM invoices o
LEFT JOIN terms t ON t.id = o.term_id
LEFT JOIN reps r ON r.id = o.rep_id
LEFT JOIN shipVia s ON s.id = o.ship_via_id 
WHERE EXISTS (
   SELECT 1
   FROM   invoiceitems i 
   WHERE  i.invoice_id = o.id
   AND    i.name ILIKE '%df%'
   AND    i.name IS NOT NULL
   ) LIMIT 100;

"Limit  (cost=0.19..43302.88 rows=100 width=610) (actual time=5771.852..5771.852 rows=0 loops=1)"
"  ->  Nested Loop Left Join  (cost=0.19..136836.68 rows=316 width=610) (actual time=5771.848..5771.848 rows=0 loops=1)"
"        ->  Nested Loop Left Join  (cost=0.19..135404.33 rows=316 width=582) (actual time=5771.844..5771.844 rows=0 loops=1)"
"              ->  Nested Loop Left Join  (cost=0.19..134052.55 rows=316 width=554) (actual time=5771.841..5771.841 rows=0 loops=1)"
"                    ->  Merge Semi Join  (cost=0.19..132700.78 rows=316 width=526) (actual time=5771.837..5771.837 rows=0 loops=1)"
"                          Merge Cond: (o.id = i.invoice_id)"
"                          ->  Index Scan using invoices_pkey on invoices o  (cost=0.00..3907.27 rows=65000 width=526) (actual time=0.017..0.017 rows=1 loops=1)"
"                          ->  Index Scan using invoiceitems_invoice_id_idx on invoiceitems i  (cost=0.00..129298.19 rows=316 width=4) (actual time=5771.812..5771.812 rows=0 loops=1)"
"                                Filter: ((name IS NOT NULL) AND (name ~~* '%df%'::text))"
"                                Rows Removed by Filter: 3281518"
"                    ->  Index Scan using terms_pkey on terms t  (cost=0.00..4.27 rows=1 width=36) (never executed)"
"                          Index Cond: (id = o.term_id)"
"              ->  Index Scan using reps_pkey on reps r  (cost=0.00..4.27 rows=1 width=36) (never executed)"
"                    Index Cond: (id = o.rep_id)"
"        ->  Index Scan using shipvia_pkey on shipvia s  (cost=0.00..4.27 rows=1 width=36) (never executed)"
"              Index Cond: (id = o.ship_via_id)"
"Total runtime: 5771.948 ms"

このアプローチは、順序が指定されていない場合にのみ実行可能であるように思われるため、invoiceitems 行を個別の invoice_id で並べ替える 3 番目のオプションは試しませんでしたが、通常はその逆で、順序が存在します。

score 1 · Accepted Answer

指数

トライグラムインデックス

モジュールによって提供されるtrigram indexを使用します。このモジュールpg_trgmは、GIN または GiST インデックスの演算子クラスを提供して、すべてLIKE(およびILIKE)左アンカーされたものだけでなくパターン)をサポートします。

dba.SEのこの関連する回答で、パターンマッチングとインデックスの概要を見つけてください。
この関連する回答 (他の多くの回答の中でも) でトライグラムインデックスを使用する方法の詳細:
PostgreSQL LIKE クエリパフォーマンスのバリエーション

例：

CREATE EXTENSION pg_tgrm;  -- only once per db

CREATE INDEX invoiceitems_name_gist_trgm_idx
ON invoiceitems USING gist (name gist_trgm_ops);

GINインデックスはおそらくさらに高速ですが、大きくなります。マニュアルを引用します：

経験則として、GINインデックスはインデックスよりも高速に検索できますがGiST、構築や更新は遅くなります。そのため、静的データや頻繁に更新されるGINデータに適しています。GiST

それはすべて、正確な要件に依存します。

追加の btree インデックス

もちろん、プレーンな btree インデックス (デフォルト) もinvoiceitems.invoice_id!に必要です。

CREATE INDEX invoiceitems_invoice_id_idx ON invoiceitems (invoice_id);

インデックスオンリースキャン用の複数列インデックス

Postgres 9.2 以降では、このインデックスをインデックスオンリースキャン用に「カバー」することで、追加の利点が得られる場合があります。integerGIN インデックスは通常、のような列には意味がありませんinvoice_id。しかし、追加のヒープルックアップを節約するには、それを複数列の GIN (または GiST) インデックスに含めることをお勧めします。テストする必要があります。

このためには、追加のモジュールbtree_gin(またはbtree_gistそれぞれ) が必要です。GIN の例:

CREATE EXTENSION btree_gin;

CREATE INDEX invoiceitems_name_gin_trgm_idx
ON invoiceitems USING gin (name gin_trgm_ops, invoice_id);

これにより、上記の btree インデックスの必要がなくなりますが、fk-checks を単独ではるかに高速にするために、とにかくそれを作成するようにしてください。ただし、他の多くの場合も同様です。

クエリ

カウンティング

のために ...

請求書の数を取得するクエリ

...害を及ぼすだけの追加のテーブルを省略します（もしあれば）：

SELECT COUNT(DISTINCT(item.invoice_id))
FROM   invoiceitems item 
JOIN   invoices o ON item.invoice_id = o.id
LEFT   JOIN terms t ON t.id = o.term_id
LEFT   JOIN reps r ON r.id = o.rep_id
LEFT   JOIN shipVia s ON s.id = o.ship_via_id
WHERE  item.name ILIKE '%ac%';

外部キー制約によって参照整合性が保証されるため、invoicesこのクエリからテーブルを省略することもできます。光沢のある新しいインデックスが起動するはずです!

(更新されたフォームEXISTは、最初のドラフトで提案した変形を不要にします。)

行を返す

返品の場合：

EXISTSここでも良いでしょう：

SELECT t.terms, r.rep, s.ship_via, ...
FROM   invoices     o
LEFT   JOIN terms   t ON t.id = o.term_id
LEFT   JOIN reps    r ON r.id = o.rep_id
LEFT   JOIN shipVia s ON s.id = o.ship_via_id
WHERE EXISTS (
   SELECT 1
   FROM   invoiceitems i 
   WHERE  i.invoice_id = o.id
   AND    i.name ILIKE '%ac%'
   )
-- ORDER BY ???
LIMIT 100;

または、上記のクエリに副選択として結合するこのバリアントをテストできます。さらに速いかもしれません：

SELECT t.terms, r.rep, s.ship_via, ...
FROM  (
   SELECT DISTINCT invoice_id
   FROM   invoiceitems
   WHERE  name ILIKE '%ac%'
   ORDER  BY invoice_id           -- order by id = cheapest with above index
   LIMIT  100                     -- LIMIT early!
   ) item
JOIN   invoices     o ON o.id = item.invoice_id
LEFT   JOIN terms   t ON t.id = o.term_id
LEFT   JOIN reps    r ON r.id = o.rep_id
LEFT   JOIN shipVia s ON s.id = o.ship_via_id
-- ORDER BY ???
;

この例では、最初の 100 by を取得しinvoice_idます (並べ替え順序を指定していないため)。それはすべて詳細に依存します...

sql - 重複のない親行のみをクエリする SQL 結合 1-many

1 に答える 1

指数

トライグラム インデックス

追加の btree インデックス

インデックス オンリー スキャン用の複数列インデックス

クエリ

カウンティング

行を返す

Related

Reference

トライグラムインデックス

インデックスオンリースキャン用の複数列インデックス