postgresql - group-by で複数の自己結合を含むビューを高速化するには、postgres_fdw を使用します

Question

（謝罪とハッカーの侵入に警告...）

バックグラウンド：

多くの SQL コードを書き直すことを避けたいレガシーアプリケーションがあります。私は、それが行う特定のタイプの非常にコストのかかるクエリを高速化しようとしています(つまり、ぶら下がっている果物)。

テーブルで表される金融取引台帳がありtransactionsます。新しい行が挿入されると、トリガー関数 (ここには示されていません) によって、特定のエンティティの新しい残高が繰り越されます。

アプリケーションが関連するトランザクションをグループ化できるように、新しいトランザクションを「関連する」トランザクションでタグ付けすることにより、一部のタイプのトランザクションモデルの外部性 (インフライトペイメントなど)。

\d transactions

                  Table "public.transactions"
       Column        |   Type    | Modifiers 
---------------------+-----------+-----------
 entityid            | bigint    | not null
 transactionid       | bigint    | not null default nextval('tid_seq')
 type                | smallint  | not null
 status              | smallint  | not null
 related             | bigint    | 
 amount              | bigint    | not null
 abs_amount          | bigint    | not null
 is_credit           | boolean   | not null
 inserted            | timestamp | not null default now()
 description         | text      | not null
 balance             | bigint    | not null

Indexes:
    "transactions_pkey" PRIMARY KEY, btree (transactionid)
    "transactions by entityid" btree (entityid)
    "transactions by initial trans" btree ((COALESCE(related, transactionid)))

Foreign-key constraints:
    "invalid related transaction!" FOREIGN KEY (related) 
                                   REFERENCES transactions(transactionid)

私のテストデータセットには、次のものがあります。

合計約 550 万行
「関連する」トランザクションがない約 370 万行
「関連」トランザクションで約 180 万行
およそ 55,000 の個別のエンティティ ID (顧客)。

したがって、すべてのトランザクション行の約 1/3 は、以前のトランザクションに「関連する」更新です。実稼働データは、約 25 倍大きく、transactionid個別的には約 8 倍大きくentityid、1/3 の比率がトランザクションの更新に適用されます。

コードは、次のように定義された特に非効率的な VIEW をクエリします。

CREATE VIEW collapsed_transactions AS
SELECT t.entityid,
    g.initial,
    g.latest,
    i.inserted AS created,
    t.inserted AS updated,
    t.type,
    t.status,
    t.amount,
    t.abs_amount,
    t.is_credit,
    t.balance,
    t.description
FROM ( SELECT 
          COALESCE(x.related, x.transactionid) AS initial,
          max(x.transactionid) AS latest
       FROM transactions x
       GROUP BY COALESCE(x.related, x.transactionid)
     ) g
INNER JOIN transactions t ON t.transactionid = g.latest
INNER JOIN transactions i ON i.transactionid = g.initial;

典型的なクエリは次の形式を取ります。

SELECT * FROM collapsed_transactions WHERE entityid = 204425;

ご覧のとおり、where entityid = 204425句はサブクエリを制約するために使用されないGROUP BYため、すべてのエンティティのトランザクションがグループ化され、55,000 の大きなサブクエリの結果セットと馬鹿げたほど長いクエリ時間になります...すべて平均 40 行に到達します(この例では 71) 執筆時点。

transactions何百ものコードベースの SQL クエリを書き直さない限り、テーブルをさらに正規化することはできません(たとえば、initial_transactionsとupdated_transactionsテーブルをで結合することによってrelated)。その多くはさまざまな方法で自己結合セマンティクスを使用します。

洞察：

最初にWINDOW関数を使用してクエリを書き直そうとしましたが、それであらゆる種類の問題に遭遇しました（別のSOの質問は別の機会に）、www_fdwがそのWHERE句をGET / POSTパラメーターとしてHTTPに渡すのを見たとき、私は非常に興味をそそられました非常に素朴なクエリが再構築することなく最適化される可能性。

Postgresql 9.3 マニュアルには次のように書かれています。

F.31.4. リモートクエリの最適化

postgres_fdw は、リモートクエリを最適化して、外部サーバーから転送されるデータの量を削減しようとします。これは、クエリの WHERE 句を実行のためにリモートサーバーに送信し、現在のクエリに不要なテーブル列を取得しないことによって行われます。クエリの誤実行のリスクを軽減するために、組み込みのデータ型、演算子、および関数のみを使用する場合を除き、WHERE 句はリモートサーバーに送信されません。句の演算子と関数も IMMUTABLE でなければなりません。

実行のためにリモートサーバーに実際に送信されるクエリは、EXPLAIN VERBOSE を使用して調べることができます。

試み：

したがって、おそらく GROUP-BY をビューに入れ、そのビューを外部テーブルとして扱い、オプティマイザーが WHERE 句をその外部テーブルに渡すことで、はるかに効率的なクエリが得られるのではないかと考えました....

CREATE VIEW foreign_transactions_grouped_by_initial_transaction AS 
  SELECT
    entityid,
    COALESCE(t.related, t.transactionid) AS initial,
    MAX(t.transactionid) AS latest
  FROM transactions t
  GROUP BY
    t.entityid,
    COALESCE(t.related, t.transactionid);

CREATE FOREIGN TABLE transactions_grouped_by_initial_transaction 
  (entityid bigint, initial bigint, latest bigint) 
  SERVER local_pg_server 
  OPTIONS (table_name 'foreign_transactions_grouped_by_initial_transaction');

EXPLAIN ANALYSE VERBOSE
  SELECT 
    t.entityid, 
    g.initial, 
    g.latest, 
    i.inserted AS created, 
    t.inserted AS updated, 
    t.type, 
    t.status,
    t.amount,
    t.abs_amount,
    t.is_credit,
    t.balance,
    t.description
  FROM transactions_grouped_by_initial_transaction g 
  INNER JOIN transactions t on t.transactionid = g.latest
  INNER JOIN transactions i on i.transactionid = g.initial 
  WHERE g.entityid = 204425;

そしてそれは非常にうまく機能します！

 Nested Loop  (cost=100.87..305.05 rows=10 width=116) 
              (actual time=4.113..16.646 rows=71 loops=1)
   Output: t.entityid, g.initial, g.latest, i.inserted, 
           t.inserted, t.type, t.status, t.amount, t.abs_amount, 
           t.balance, t.description
   ->  Nested Loop  (cost=100.43..220.42 rows=10 width=108) 
                    (actual time=4.017..10.725 rows=71 loops=1)
         Output: g.initial, g.latest, t.entityid, t.inserted, 
                 t.type, t.status, t.amount, t.abs_amount, t.is_credit,
                 t.balance, t.description
     ->  Foreign Scan on public.transactions_grouped_by_initial_transaction g
                 (cost=100.00..135.80 rows=10 width=16) 
                 (actual time=3.914..4.694 rows=71 loops=1)
            Output: g.entityid, g.initial, g.latest
            Remote SQL: 
              SELECT initial, latest
              FROM public.foreign_transactions_grouped_by_initial_transaction
              WHERE ((entityid = 204425))
         ->  Index Scan using transactions_pkey on public.transactions t  
                  (cost=0.43..8.45 rows=1 width=100) 
                  (actual time=0.023..0.035 rows=1 loops=71)
               Output: t.entityid, t.transactionid, t.type, t.status, 
                       t.related, t.amount, t.abs_amount, t.is_credit, 
                       t.inserted, t.description, t.balance
               Index Cond: (t.transactionid = g.latest)
   ->  Index Scan using transactions_pkey on public.transactions i  
            (cost=0.43..8.45 rows=1 width=16) 
            (actual time=0.021..0.033 rows=1 loops=71)
         Output: i.entityid, i.transactionid, i.type, i.status, 
                 i.related, i.amount, i.abs_amount, i.is_credit, 
                 i.inserted, i.description, i.balance
         Index Cond: (i.transactionid = g.initial)
 Total runtime: 20.363 ms

問題：

ただし、それをビューに焼き付けようとすると（別のレイヤーの有無にかかわらずpostgres_fdw）、クエリオプティマイザーはWHERE句を通過しないようです:-(

CREATE view collapsed_transactions_fast AS
  SELECT 
    t.entityid, 
    g.initial, 
    g.latest, 
    i.inserted AS created, 
    t.inserted AS updated, 
    t.type, 
    t.status,
    t.amount,
    t.abs_amount,
    t.is_credit,
    t.balance,
    t.description
  FROM transactions_grouped_by_initial_transaction g 
  INNER JOIN transactions t on t.transactionid = g.latest
  INNER JOIN transactions i on i.transactionid = g.initial;

EXPLAIN ANALYSE VERBOSE
  SELECT * FROM collapsed_transactions_fast WHERE entityid = 204425;

結果:

Nested Loop  (cost=534.97..621.88 rows=1 width=117) 
             (actual time=104720.383..139307.940 rows=71 loops=1)
  Output: t.entityid, g.initial, g.latest, i.inserted, t.inserted, t.type, 
          t.status, t.amount, t.abs_amount, t.is_credit, t.balance, 
          t.description
  ->  Hash Join  (cost=534.53..613.66 rows=1 width=109) 
                 (actual time=104720.308..139305.522 rows=71 loops=1)
        Output: g.initial, g.latest, t.entityid, t.inserted, t.type, 
                t.status, t.amount, t.abs_amount, t.is_credit, t.balance, 
                t.description
        Hash Cond: (g.latest = t.transactionid)
    ->  Foreign Scan on public.transactions_grouped_by_initial_transaction g
         (cost=100.00..171.44 rows=2048 width=16) 
         (actual time=23288.569..108916.051 rows=3705600 loops=1)
           Output: g.entityid, g.initial, g.latest
           Remote SQL: 
            SELECT initial, latest 
            FROM public.foreign_transactions_grouped_by_initial_transaction
        ->  Hash  (cost=432.76..432.76 rows=142 width=101) 
                  (actual time=2.103..2.103 rows=106 loops=1)
              Output: 
                t.entityid, t.inserted, t.type, t.status, t.amount, 
                t.abs_amount, t.is_credit, t.balance, t.description, 
                t.transactionid
              Buckets: 1024  Batches: 1  Memory Usage: 14kB
              ->  Index Scan using "transactions by entityid" 
                  on public.transactions t  
                     (cost=0.43..432.76 rows=142 width=101) 
                     (actual time=0.049..1.241 rows=106 loops=1)
                    Output: t.entityid, t.inserted, t.type, t.status, 
                            t.amount, t.abs_amount, t.is_credit, 
                            t.balance, t.description, t.transactionid
                    Index Cond: (t.entityid = 204425)
  ->  Index Scan using transactions_pkey on public.transactions i  
        (cost=0.43..8.20 rows=1 width=16) 
        (actual time=0.013..0.018 rows=1 loops=71)
        Output: i.entityid, i.transactionid, i.type, i.status, i.related, 
                i.amount, i.abs_amount, i.is_credit, i.inserted, i.description, 
                 i.balance
        Index Cond: (i.transactionid = g.initial)
Total runtime: 139575.140 ms

その動作を VIEW または FDW に組み込むことができれば、非常に少数のクエリで VIEWの名前を置き換えるだけで、より効率的にすることができます。他のユースケース (より複雑な WHERE 句) で非常に遅いかどうかは気にせず、使用目的を反映するように VIEW に名前を付けます。

のuse_remote_estimateデフォルト値はですFALSEが、どちらの方法でも違いはありません。

質問：

この認められたハッキングを機能させるために使用できるトリックはありますか?

score 2 · Accepted Answer

私があなたの質問を正しく理解していれば、答えは「いいえ」です。句が fdw ラッパーを通過した場合に余分な情報を取得する「トリック」はありません。

しかし、おそらくあなたは間違ったことを最適化していると思います。

ビュー全体を置き換えcollapsed_transactionsます。何かが欠けていない限り、トランザクションテーブルのみに依存します。テーブルを作成し、トリガーを使用して更新し、通常のユーザーにのみ SELECT 権限を付与します。まだテストツールを持っていない場合は、 pgtapからいくつかのテストツールを入手してください。

編集: ビューの最適化。

ビューに対してその 1 つのクエリを最適化し、ビューの定義を微調整できる場合は、これを試してください。

CREATE VIEW collapsed_transactions AS
SELECT
    g.entityid,  -- THIS HERE
    g.initial,
    g.latest,
    i.inserted AS created,
    t.inserted AS updated,
    t.type,
    t.status,
    t.amount,
    t.abs_amount,
    t.is_credit,
    t.balance,
    t.description
FROM (
    SELECT 
    entityid, -- THIS HERE
    COALESCE(x.related, x.transactionid) AS initial,
    max(x.transactionid) AS latest
    FROM transactions x
    GROUP BY entityid, COALESCE(x.related, x.transactionid)
) g
INNER JOIN transactions t ON t.transactionid = g.latest
INNER JOIN transactions i ON i.transactionid = g.initial;

サブクエリはエンティティ ID を公開し、それをフィルタリングできることに注意してください。メインアイテムと関連アイテムのエンティティ ID は一定であると想定しています。それ以外の場合、クエリがどのように機能するかわかりません。これにより、プランナーは問題を十分に把握して、最初にエンティティ ID のインデックスを使用し、クエリをミリ秒のタイミングにまで下げることができます。

postgresql - group-by で複数の自己結合を含むビューを高速化するには、postgres_fdw を使用します

1 に答える 1

Related

Reference