sql - グループ化 > 1 のすべての結果を選択

Question

住所のあるアカウントでいっぱいのテーブルがあります。別のアカウントと同じアドレスに存在するすべてのアカウントを選択したいと考えています。

私のデータが次のように見える場合：

------------------------------------
| Account Number | Address          |
| 12345          | 55 Bee St        |
| 23456          | 94 Water way     |
| 34567          | 15 Beagle Drive  |
| 45678          | 55 Bee St        |
| 56789          | 94 Water way     |
| 67890          | 12 Green St      |
-------------------------------------

私は次の行に沿って何かをしたいと思います:

SELECT * FROM accounts WHERE group by address > 1;

私の結果は次のようになります。

------------------------------------
| Account Number | Address          |
| 12345          | 55 Bee St        |
| 23456          | 94 Water way     |
| 45678          | 55 Bee St        |
| 56789          | 94 Water way     |
-------------------------------------

違いがあれば、それは PostgreSQL データベースです。

score 1 · Accepted Answer

同じテーブルに対して左結合を作成して、同じ住所を持つレコードを検索し、フィールドでグループ化すると、一致する住所を数えて、少なくとも 1 つの一致する住所を持つレコードを取得できます。

select a.AccountNumber, a.Address
from accounts a
left join accounts o on o.Address = a.Address and o.AccountNumber <> a.AccountNumber
group by a.AccountNumber, a.Address
having count(o.AccountNumber) >= 1

このアプローチでは、各口座番号の住所が得られ、住所が 2 回以上発生しても重複はありません。

score 1 · Accepted Answer

2 つのアドレスが同じであるが、アカウント番号が 2 つの行間で異なることを確認する結合条件を使用して、テーブルをそれ自体に結合する必要があります。

select distinct account_number, address
from accounts a1
join accounts a2 on a1.account_number > a2.account_number
   and a1.address = a2.address

>行がそれ自体に結合するのを防ぐだけでなく、逆結合も防ぐアカウント番号間の比較の使用に注意してください。

同じアドレスのアカウントが 3 つある場合に備えて追加distinctしました。それ以外の場合は必要ありません。

score 1 · Accepted Answer

これでうまくいくはずです：

SELECT *
FROM Account A1
WHERE
    EXISTS (
        SELECT *
        FROM Account A2
        WHERE
            A1.AccountNumber <> A2.AccountNumber
            AND A1.Address = A2.Address
    )

A1.AccountNumber <> A2.AccountNumber平易な英語で:同じアドレス ( ) を持つ別のアカウント ( ) があるように、すべてのアカウントを選択しますA1.Address = A2.Address。

score 1 · Accepted Answer

Here is the test demonstrating performance of the three valid answers.
EXISTS is superior to the one with LEFT JOIN / GROUP BY:

Test setup

Table with 100k rows, 1000 different values for b.
The performance gap widens with more rows - fewer duplicates means less difference.
No indexes.

CREATE TABLE tbl (a text, b text);
INSERT INTO tbl
SELECT (random()*10000)::int::text
      ,(random()*1000)::int || ' some more text here'
FROM   generate_series(1, 100000) g;

1. @Guffa: `LEFT JOIN` / `GROUP BY` / `HAVING`

EXPLAIN ANALYZE
SELECT t.a, t.b
FROM   tbl t
LEFT   join tbl t2 on t2.b = t.b and t2.a <> t.a
GROUP  by t.a, t.b
HAVING count(t2.a) >= 1;

2. The same, untangled to just `JOIN` / `GROUP BY`

EXPLAIN ANALYZE
SELECT t.a, t.b
FROM   tbl t
JOIN   tbl t2 ON t2.b = t.b AND t2.a <> t.a
GROUP  BY t.a, t.b;

3.@Branko: `EXISTS`

EXPLAIN ANALYZE
SELECT *
FROM   tbl t
WHERE  EXISTS (
    SELECT *
    FROM   tbl t2
    WHERE  t2.a <> t.a
    AND    t2.b = t.b
    );

4.@Bohemian: `DISTINCT`

EXPLAIN ANALYZE
SELECT DISTINCT t.a, t.b
FROM   tbl t
JOIN   tbl t2 on t2.b = t.b and t2.a <> t.a;

-> SQLfiddle displaying EXPLAIN ANALYZE output for the queries.

Total runtime: 12208.954 ms
Total runtime: 11504.460 ms
Total runtime: 272.508 ms -- ! ~ 45x faster than 1.
Total runtime: 11540.627 ms

After adding multi-column index (SQLfiddle) ..

CREATE INDEX a_b_idx ON tbl(b, a);

.. the runtime doesn't change. Postgres doesn't use the index. It obviously expects a sequential table-scan to be faster since the whole table has to be read anyway.

Besides the execution time, also notice the row count, proving my point as discussed:
The JOIN creates a lot of intermediate duplicates, which the EXISTS version avoids to begin with:

Output of EXPLAIN ANALYZE for 1.:

HashAggregate (cost=230601.26..230726.26 rows=10000 width=31) (actual time=12127.090..12183.087 rows=99476 loops=1)
Filter: (count(t2.a) >= 1)
-> Hash Left Join (cost=3670.00..154661.89 rows=10125250 width=31) (actual time=99.591..5897.744 rows=9991102 loops=1)
Hash Cond: (t.b = t2.b)
Join Filter: (t2.a  t.a)
Rows Removed by Join Filter: 101052
-> Seq Scan on tbl t (cost=0.00..1736.00 rows=100000 width=27) (actual time=0.036..36.197 rows=100000 loops=1)
-> Hash (cost=1736.00..1736.00 rows=100000 width=27) (actual time=99.141..99.141 rows=100000 loops=1)
Buckets: 2048 Batches: 8 Memory Usage: 784kB
-> Seq Scan on tbl t2 (cost=0.00..1736.00 rows=100000 width=27) (actual time=0.004..44.899 rows=100000 loops=1)
Total runtime: 12208.954 ms

Output of EXPLAIN ANALYZE for 3.:

Hash Semi Join (cost=3670.00..7783.00 rows=1 width=27) (actual time=81.630..247.371 rows=100000 loops=1)
Hash Cond: (t.b = t2.b)
Join Filter: (t2.a  t.a)
Rows Removed by Join Filter: 1009
-> Seq Scan on tbl t (cost=0.00..1736.00 rows=100000 width=27) (actual time=0.010..32.758 rows=100000 loops=1)
-> Hash (cost=1736.00..1736.00 rows=100000 width=27) (actual time=81.388..81.388 rows=100000 loops=1)
Buckets: 2048 Batches: 8 Memory Usage: 784kB
-> Seq Scan on tbl t2 (cost=0.00..1736.00 rows=100000 width=27) (actual time=0.003..32.114 rows=100000 loops=1)
Total runtime: 272.508 ms

score 0 · Accepted Answer

0

HAVING句が必要です：

SELECT * FROM accounts
GROUP BY address
HAVING COUNT(address) > 1;

于 2013-03-28T01:56:48.737 に答える

score 0 · Accepted Answer

HAVING句を探していると思います：

 select address,sum(accountnumber) group by address having sum(accountnumber) >1

sql - グループ化 > 1 のすべての結果を選択

6 に答える 6

Test setup

1. @Guffa: LEFT JOIN / GROUP BY / HAVING

2. The same, untangled to just JOIN / GROUP BY

3.@Branko: EXISTS

4.@Bohemian: DISTINCT

Related

Reference

1. @Guffa: `LEFT JOIN` / `GROUP BY` / `HAVING`

2. The same, untangled to just `JOIN` / `GROUP BY`

3.@Branko: `EXISTS`

4.@Bohemian: `DISTINCT`