vertica - Vertica データベースの重複行を削除する

Question

Vertica では、複製をテーブルに挿入できます。「analyze_constraints」関数を使用してそれらを表示できます。Vertica テーブルから重複行を削除するには?

score 2 · Accepted Answer

頭のてっぺんから外れており、素晴らしい答えではないので、これを最後の言葉にしましょう。両方を削除して、1 つを挿入し直すことができます。

score 2 · Accepted Answer

Mauro の答えは正しいですが、ステップ 2 の sql にエラーがあります。したがって、DELETE を回避して作業する完全な方法は次のようになります。

ステップ 1重複を含むテーブルと同じ構造/プロジェクションを持つ新しいテーブルを作成します。

create table mytable_new like mytable including projections ;

ステップ 2この新しいテーブルに重複排除された行を挿入します。

insert /* +direct */ into mytable_new select <column list> from (
            select * , row_number() over ( partition by <pk column list> ) as rownum from mytable
    ) a where a.rownum = 1 ;

ステップ 3元のテーブル (dup を含むテーブル) の名前を変更します。

alter table mytable rename to mytable_orig ;

ステップ 4新しいテーブルの名前を変更します。

alter table mytable_new rename to mytable ;

score 1 · Accepted Answer

一時テーブルを作成し、疑似 row_id を生成することで、Vertica テーブルによる重複を削除できます。特に非常に大きく幅の広いテーブルから重複を削除する場合は、いくつかの手順を実行します。以下の例では、k1 行と k2 行に複数の重複があると想定しています。詳細については、こちらを参照してください。

-- Find the duplicates
select keys, count(1) from large-table-1
where [where-conditions]
group by 1
having count(1) > 1
order by count(1) desc  ;

-- Step 2:  Dump the duplicates into temp table
create table test.large-table-1-dups
like large-table-1;

alter table test.large-table-1-dups     -- add row_num column (pseudo row_id)
add column row_num int;

insert into test.large-table-1-dups
select *, ROW_NUMBER() OVER(PARTITION BY key)
from large-table-1
where key in ('k1', 'k2');     -- where, say, k1 has n and k2 has m exact dups

-- Step 3: Remove duplicates from the temp table
delete from test.large-table-1-dups
where row_num > 1;

select * from test.dim_line_items_dups;    
--  Sanity test.  Should have 1 row each of k1 & k2 rows above

-- Step 4: Delete all duplicates from main table...
delete from large-table-1
where key in ('k1', 'k2');

-- Step 5: Insert data back into main table from temp dedupe data
alter table test.large-table-1-dups
drop column row_num;

insert into large-table-1
select * from test.large-table-1-dups;

score -2 · Accepted Answer

Vertica でも機能するPostgreSQL wikiからのこの回答をご覧ください。

DELETE
FROM
    tablename
WHERE
    id IN(
        SELECT
            id
        FROM
            (
                SELECT
                    id,
                    ROW_NUMBER() OVER(
                        partition BY column1,
                        column2,
                        column3
                    ORDER BY
                        id
                    ) AS rnum
                FROM
                    tablename
            ) t
        WHERE
            t.rnum > 1
    );

重複するエントリはすべて削除されますが、ID が最も小さいエントリが削除されます。

vertica - Vertica データベースの重複行を削除する

6 に答える 6

Related

Reference