Here is the test demonstrating performance of the three valid answers.
EXISTS
is superior to the one with LEFT JOIN
/ GROUP BY
:
Test setup
Table with 100k rows, 1000 different values for b
.
The performance gap widens with more rows - fewer duplicates means less difference.
No indexes.
CREATE TABLE tbl (a text, b text);
INSERT INTO tbl
SELECT (random()*10000)::int::text
,(random()*1000)::int || ' some more text here'
FROM generate_series(1, 100000) g;
1. @Guffa: LEFT JOIN
/ GROUP BY
/ HAVING
EXPLAIN ANALYZE
SELECT t.a, t.b
FROM tbl t
LEFT join tbl t2 on t2.b = t.b and t2.a <> t.a
GROUP by t.a, t.b
HAVING count(t2.a) >= 1;
2. The same, untangled to just JOIN
/ GROUP BY
EXPLAIN ANALYZE
SELECT t.a, t.b
FROM tbl t
JOIN tbl t2 ON t2.b = t.b AND t2.a <> t.a
GROUP BY t.a, t.b;
EXPLAIN ANALYZE
SELECT *
FROM tbl t
WHERE EXISTS (
SELECT *
FROM tbl t2
WHERE t2.a <> t.a
AND t2.b = t.b
);
EXPLAIN ANALYZE
SELECT DISTINCT t.a, t.b
FROM tbl t
JOIN tbl t2 on t2.b = t.b and t2.a <> t.a;
-> SQLfiddle displaying EXPLAIN ANALYZE output for the queries.
- Total runtime: 12208.954 ms
- Total runtime: 11504.460 ms
- Total runtime: 272.508 ms -- ! ~ 45x faster than 1.
- Total runtime: 11540.627 ms
After adding multi-column index (SQLfiddle) ..
CREATE INDEX a_b_idx ON tbl(b, a);
.. the runtime doesn't change. Postgres doesn't use the index. It obviously expects a sequential table-scan to be faster since the whole table has to be read anyway.
Besides the execution time, also notice the row count, proving my point as discussed:
The JOIN creates a lot of intermediate duplicates, which the EXISTS
version avoids to begin with:
Output of EXPLAIN ANALYZE
for 1.:
HashAggregate (cost=230601.26..230726.26 rows=10000 width=31) (actual time=12127.090..12183.087 rows=99476 loops=1)
Filter: (count(t2.a) >= 1)
-> Hash Left Join (cost=3670.00..154661.89 rows=10125250 width=31) (actual time=99.591..5897.744 rows=9991102 loops=1)
Hash Cond: (t.b = t2.b)
Join Filter: (t2.a t.a)
Rows Removed by Join Filter: 101052
-> Seq Scan on tbl t (cost=0.00..1736.00 rows=100000 width=27) (actual time=0.036..36.197 rows=100000 loops=1)
-> Hash (cost=1736.00..1736.00 rows=100000 width=27) (actual time=99.141..99.141 rows=100000 loops=1)
Buckets: 2048 Batches: 8 Memory Usage: 784kB
-> Seq Scan on tbl t2 (cost=0.00..1736.00 rows=100000 width=27) (actual time=0.004..44.899 rows=100000 loops=1)
Total runtime: 12208.954 ms
Output of EXPLAIN ANALYZE
for 3.:
Hash Semi Join (cost=3670.00..7783.00 rows=1 width=27) (actual time=81.630..247.371 rows=100000 loops=1)
Hash Cond: (t.b = t2.b)
Join Filter: (t2.a t.a)
Rows Removed by Join Filter: 1009
-> Seq Scan on tbl t (cost=0.00..1736.00 rows=100000 width=27) (actual time=0.010..32.758 rows=100000 loops=1)
-> Hash (cost=1736.00..1736.00 rows=100000 width=27) (actual time=81.388..81.388 rows=100000 loops=1)
Buckets: 2048 Batches: 8 Memory Usage: 784kB
-> Seq Scan on tbl t2 (cost=0.00..1736.00 rows=100000 width=27) (actual time=0.003..32.114 rows=100000 loops=1)
Total runtime: 272.508 ms