sql - Futher optimization of Postgres Full-text search on a Cloud Server

Question

I am running Postgres 9.1 on a cloud server (which I'm aware is far from ideal, we are hoping to migrate at some point this year). This server frequently performs full text queries on a table with more than 3 million records. Here is an example of a typical query.

SELECT id 
FROM Table 
WHERE datepublished BETWEEN $$2012-05-01 00:00:00$$ AND $$2013-05-15 23:59:59$$ 
  AND hide = false 
  AND ( tsvall @@ to_tsquery('query')) 
ORDER BY datepublished DESC

The columns datepublished, hide and tsvall are all indexed , tsvall is indexed using GIN. The postgres configuration settings shared_buffers, effective_cache_size, work_mem have also been tweaked.

For an a typical query (query text used was "august") using the parameters in the example above returning 986 rows took 5 seconds. I really would like to get to speed this up. Your help would be greatly appreciated, I can provide extra information on request.

--EDIT : EXPLAIN ANALYSE RESULTS

 Sort  (cost=15352.87..15355.18 rows=927 width=16) (actual time=17705.293..17706.266 rows=849 loops=1)
   Sort Key: datepublished
   Sort Method:  quicksort  Memory: 64kB
   ->  Bitmap Heap Scan on post  (cost=1049.44..15307.18 rows=927 width=16) (actual time=63.520..17702.219 rows=849 loops=1)
         Recheck Cond: (tsvall @@ to_tsquery('trialing'::text))
         Filter: ((at IS NULL) AND (NOT hide) AND (datepublished >= '2012-05-04 00:00:00'::timestamp without time zone) AND (datepublished <= '2013-06-04 23:59:59'::timestamp without time zone))
         ->  Bitmap Index Scan on index_tsvall  (cost=0.00..1049.20 rows=3758 width=0) (actual time=62.537..62.537 rows=4814 loops=1)
               Index Cond: (tsvall @@ to_tsquery('trialing'::text))
 Total runtime: 17707.280 ms

here is a link as requested in the comments http://explain.depesz.com/s/QDAb

--EDIT 2

I realised that my datepublished index was not orderd so I have created a (btree) index on the column which is ordered DESC. Here is an example of an EXPLAIN ANALYSE output I am now getting

   ->  Bitmap Heap Scan on post  (cost=65485.44..82297.13 rows=4441 width=16) (actual time=1397.734..7775.204 rows=3161 loops=1)
         Recheck Cond: ((tsvall @@ to_tsquery('debate'::text)) AND (datepublished >= '2013-04-01 00:00:00'::timestamp without time zone) AND (datepublished <= '2013-06-04 23:59:59'::timestamp without time zone) AND (at IS NULL))
         Filter: (NOT hide)
         ->  BitmapAnd  (cost=65485.44..65485.44 rows=4456 width=0) (actual time=1396.544..1396.544 rows=0 loops=1)
               ->  Bitmap Index Scan on index_tsvall  (cost=0.00..13526.88 rows=67979 width=0) (actual time=531.941..531.941 rows=71502 loops=1)
                     Index Cond: (tsvall @@ to_tsquery('debate'::text))
               ->  Bitmap Index Scan on datepublished_index  (cost=0.00..23142.92 rows=1103417 width=0) (actual time=382.808..382.808 rows=1164707 loops=1)
                     Index Cond: ((datepublished >= '2013-04-01 00:00:00'::timestamp without time zone) AND (datepublished <= '2013-06-04 23:59:59'::timestamp without time zone))
               ->  Bitmap Index Scan on index_at  (cost=0.00..28811.80 rows=1253179 width=0) (actual time=422.077..422.077 rows=1319617 loops=1)
                     Index Cond: (at IS NULL)

And here again is the link as requested http://explain.depesz.com/s/Ksss

Mark

score 2 · Accepted Answer

統計の見積もりは合理的であり、非常に単純なクエリプランです。それ自体が問題の一部です。

(非常に) 高価なノードは、ビットマップヒープスキャンです。

Bitmap Heap Scan on post (cost=1049.44..15307.18 rows=927 width=16) (actual time=63.520..17702.219 rows=849 loops=1)
  Recheck Cond: (tsvall @@ to_tsquery('trialing'::text))
  Filter: ((at IS NULL) AND (NOT hide) AND (datepublished >= '2012-05-04 00:00:00'::timestamp without time zone) AND (datepublished <= '2013-06-04 23:59:59'::timestamp without time zone))

大きくて複雑なフィルター句が見えますか? これは、Pg がこれらの条件を満たすために他のインデックスを使用していないことを意味するという点で、少し奇妙です。

random_page_costPgが他のインデックスを使用する原因を減らしていますか?

最悪の場合、明示的な CTE を使用して Pg に最初に他のフィルターを実行させ、次に結果に tsquery を適用することができる場合があります。これには少しマテリアライズが必要になるため、次のような理想的ではありません。

WITH prefiltered AS (
  SELECT id, tsvall, datepublished
  FROM Table 
  WHERE datepublished BETWEEN '2012-05-01 00:00:00' AND '2013-05-15 23:59:59' 
    AND hide = false 
)
SELECT
FROM prefiltered
WHERE tsvall @@ to_tsquery('query')
ORDER BY datepublished DESC;

または、Denis が賢明に指摘したように、次のような複合 btree-gist インデックスを作成してみてください。

CREATE INDEX posts_blah ON posts USING gist(datepublished, tsvall) WHERE (hide = false);

...ただし、このインデックスのサイズが大きいことと、その更新コストが問題になる可能性があります。

sql - Futher optimization of Postgres Full-text search on a Cloud Server

2 に答える 2

Related

Reference