regex - Postgres 9.1 で更新クエリが遅すぎる

Question

私の問題は、1,400 万行のテーブルに対する更新クエリが非常に遅いことです。サーバーを調整するためにさまざまなことを試しましたが、パフォーマンスは向上しましたが、更新クエリにはなりませんでした。

私は2つのテーブルを持っています:

4 つの列と 3 つのインデックスを持つ T1 (530 行)
15 列と 3 つのインデックスを持つ T2 (1400 万行)
テキストフィールド stxt で 2 つのテーブルを結合して、T2 のフィールド vid (整数型) を T1 の vid と同じ値で更新したいと考えています。

ここに私のクエリとその出力があります:

explain analyse 
update T2 
  set vid=T1.vid 
from T1 
where stxt2 ~ stxt1 and T2.vid = 0;

T2 の更新 (コスト = 0.00..9037530.59 行 = 2814247 幅 = 131) (実際の時間 = 25141785.741..25141785.741 行 = 0 ループ = 1)
 -> ネストされたループ (コスト=0.00..9037530.59 行=2814247 幅=131) (実際の時間=32.636..25035782.995 行=679354 ループ=1)
             結合フィルター: ((T2.stxt2)::text ~ (T1.stxt1)::text)
             -> T2 の Seq スキャン (コスト=0.00..594772.96 行=1061980 幅=121) (実際の時間=0.067..5402.614 行=1037809 ループ=1)
                         フィルタ: (vid= 1)
             -> マテリアライズ (コスト=0.00..17.95 行=530 幅=34) (実際の時間=0.000..0.069 行=530 ループ=1037809)
                         -> T1 の Seq スキャン (コスト = 0.00..15.30 行 = 530 幅 = 34) (実際の時間 = 0.019..0.397 行 = 530 ループ = 1)
総実行時間: 25141785.904 ミリ秒

ご覧のとおり、クエリには約 25141 秒 (~ 7 時間) かかりました。fよくわかりました、プランナーは実行時間を 9037 秒 (~ 2.5 時間) と見積もっています。ここで何か不足していますか？

私のサーバー構成に関する情報は次のとおりです。

CentOS 5.8、20 GB の RAM
共有バッファ = 12GB
work_mem = 64MB
Maintenance_work_mem = 64MB
bgwriter_lru_maxpages = 500
チェックポイント_セグメント = 64
チェックポイント完了ターゲット = 0.9
effective_cache_size = 10GB

テーブル T2 で完全にバキュームを実行して数回分析しましたが、それでも状況はあまり改善されません。

PS: full_page_writes を off に設定すると、更新クエリが大幅に改善されますが、データ損失の危険は冒したくありません。何かお勧めはありますか？

score 3 · Accepted Answer

これは解決策ではありませんが、データモデリングの回避策です

URLを{protocol、hostname、pathname}コンポーネントに分割します。
これで、完全一致を使用してホスト名部分を結合し、正規表現一致の先頭の％を回避できます。
このビューは、必要に応じてfull_urlを再構築できることを示すことを目的としています。

更新にはおそらく数分かかる可能性があります。

SET search_path='tmp';

DROP TABLE urls CASCADE;
CREATE TABLE urls
        ( id SERIAL NOT NULL PRIMARY KEY
        , full_url varchar
        , proto varchar
        , hostname varchar
        , pathname varchar
        );

INSERT INTO urls(full_url) VALUES
 ( 'ftp://www.myhost.com/secret.tgz' )
,( 'http://www.myhost.com/robots.txt' )
,( 'http://www.myhost.com/index.php' )
,( 'https://www.myhost.com/index.php' )
,( 'http://www.myhost.com/subdir/index.php' )
,( 'https://www.myhost.com/subdir/index.php' )
,( 'http://www.hishost.com/index.php' )
,( 'https://www.hishost.com/index.php' )
,( 'http://www.herhost.com/index.php' )
,( 'https://www.herhost.com/index.php' )
        ;

UPDATE urls
SET proto = split_part(full_url, '://' , 1)
        , hostname = split_part(full_url, '://' , 2)
        ;

UPDATE urls
SET pathname = substr(hostname, 1+strpos(hostname, '/' ))
        , hostname = split_part(hostname, '/' , 1)
        ;

        -- the full_url field is now redundant: we can drop it
ALTER TABLE urls
        DROP column full_url
        ;
        -- and we could always reconstruct the full_url from its components.
CREATE VIEW vurls AS (
        SELECT id
        , proto || '://' || hostname || '/' || pathname AS full_url
        , proto
        , hostname
        , pathname
        FROM urls
        );

SELECT * FROM urls;
        ;
SELECT * FROM vurls;
        ;

出力：

INSERT 0 10
UPDATE 10
UPDATE 10
ALTER TABLE
CREATE VIEW
 id | proto |    hostname     |     pathname     
----+-------+-----------------+------------------
  1 | ftp   | www.myhost.com  | secret.tgz
  2 | http  | www.myhost.com  | robots.txt
  3 | http  | www.myhost.com  | index.php
  4 | https | www.myhost.com  | index.php
  5 | http  | www.myhost.com  | subdir/index.php
  6 | https | www.myhost.com  | subdir/index.php
  7 | http  | www.hishost.com | index.php
  8 | https | www.hishost.com | index.php
  9 | http  | www.herhost.com | index.php
 10 | https | www.herhost.com | index.php
(10 rows)

 id |                full_url                 | proto |    hostname     |     pathname     
----+-----------------------------------------+-------+-----------------+------------------
  1 | ftp://www.myhost.com/secret.tgz         | ftp   | www.myhost.com  | secret.tgz
  2 | http://www.myhost.com/robots.txt        | http  | www.myhost.com  | robots.txt
  3 | http://www.myhost.com/index.php         | http  | www.myhost.com  | index.php
  4 | https://www.myhost.com/index.php        | https | www.myhost.com  | index.php
  5 | http://www.myhost.com/subdir/index.php  | http  | www.myhost.com  | subdir/index.php
  6 | https://www.myhost.com/subdir/index.php | https | www.myhost.com  | subdir/index.php
  7 | http://www.hishost.com/index.php        | http  | www.hishost.com | index.php
  8 | https://www.hishost.com/index.php       | https | www.hishost.com | index.php
  9 | http://www.herhost.com/index.php        | http  | www.herhost.com | index.php
 10 | https://www.herhost.com/index.php       | https | www.herhost.com | index.php
(10 rows)

score 0 · Accepted Answer

ありがとう、これは助けになります。だからここに私がしたことがあります：

あなたが言及したように、テーブルのURLを作成しました
整数型のvid列を追加しました
T2 から full_url 列に 1000000 行を挿入しました
タイミングを有効にし、「http」も「www」も含まない full_url でホスト名列を更新しました update urls set hostname=full_url where full_url not like '%/%' and full_url not like 'www\.%';

Time: 112435.192 ms

次に、次のクエリを実行します。

    mydb=> explain analyse update urls set vid=vid from T1 where hostname=stxt1; 
             QUERY PLAN                                                          
            -----------------------------------------------------------------------------------------------------------------------------
             Update on urls  (cost=21.93..37758.76 rows=864449 width=124) (actual time=767.793..767.793 rows=0 loops=1)
                 ->  Hash Join  (cost=21.93..37758.76 rows=864449 width=124) (actual time=102.324..430.448 rows=94934 loops=1)
                             Hash Cond: ((urls.hostname)::text = (T1.stxt1)::text)
                             ->  Seq Scan on urls  (cost=0.00..25612.52 rows=927952 width=114) (actual time=0.009..265.962 rows=927952 loops=1)
                             ->  Hash  (cost=15.30..15.30 rows=530 width=34) (actual time=0.444..0.444 rows=530 loops=1)
                                         Buckets: 1024  Batches: 1  Memory Usage: 35kB
                                         ->  Seq Scan on T1  (cost=0.00..15.30 rows=530 width=34) (actual time=0.002..0.181 rows=530 loops=1)
             Total runtime: 767.860 ms

合計実行時間には本当に驚きました! 完全に一致するアップデートについてあなたが言ったことを確認する1秒未満。今、私はこの方法で xtxt1 と stxt2 の間の正確な一致を検索しました:

mydb=> select count(*) from T2 where vid is null and exists(select null from T1 where stxt1=stxt2);
 count  
--------
 308486
(1 row)

したがって、T2 テーブルで更新を試みたところ、次のようになりました。

mydb=> explain analyse update T2 set vid = T1.vid from T1 where T2.vid is null and stxt2=stxt1;
                                                                                                                            QUERY PLAN                                                               
---------------------------------------------------------------------------------------------------------------------------------------
 Update on T2  (cost=21.93..492023.13 rows=2106020 width=131) (actual time=252395.118..252395.118 rows=0 loops=1)
     ->  Hash Join  (cost=21.93..492023.13 rows=2106020 width=131) (actual time=1207.897..4739.515 rows=308486 loops=1)
                 Hash Cond: ((T2.stxt2)::text = (T1.stxt1)::text)
                 ->  Seq Scan on T2  (cost=0.00..455452.09 rows=4130377 width=121) (actual time=158.773..3915.379 rows=4103865 loops=1)
                             Filter: (vid IS NULL)
                 ->  Hash  (cost=15.30..15.30 rows=530 width=34) (actual time=0.293..0.293 rows=530 loops=1)
                             Buckets: 1024  Batches: 1  Memory Usage: 35kB
                             ->  Seq Scan on T1  (cost=0.00..15.30 rows=530 width=34) (actual time=0.005..0.121 rows=530 loops=1)
 Total runtime: 252395.204 ms
(9 rows)

Time: 255389.704 ms

実際、255 秒は、このようなクエリには非常に適しているようです。すべての URL からホスト名の部分を抽出して、更新を試みます。完全に一致する更新が高速であることを確認する必要があります。これは、悪い経験があったためです。

ご支援いただきありがとうございます。

regex - Postgres 9.1 で更新クエリが遅すぎる

3 に答える 3

Related

Reference