amazon-web-services - Amazon S3 から Redshift にデータをコピーし、行の重複を避ける

Question

Amazon S3 から Redshift にデータをコピーしています。このプロセス中に、同じファイルが再度読み込まれるのを避ける必要があります。Redshift テーブルに一意の制約はありません。copy コマンドを使用してこれを実装する方法はありますか?

http://docs.aws.amazon.com/redshift/latest/dg/r_COPY_command_examples.html

一意の制約を追加し、列を主キーとして設定しようとしましたが、うまくいきませんでした。Redshift は、一意/主キーの制約をサポートしていないようです。

score 17 · Accepted Answer

user1045047 が述べたように、Amazon Redshift は一意の制約をサポートしていないため、delete ステートメントを使用してテーブルから重複レコードを削除する方法を探していました。最後に、私は合理的な方法を見つけました。

Amazon Redshift は、自動生成された一意の番号を格納する IDENTITY 列の作成をサポートしています。 http://docs.aws.amazon.com/redshift/latest/dg/r_CREATE_TABLE_NEW.html

次の SQL は、PostgreSQL が一意の列である OID を持つ重複レコードを削除するためのもので、OID を ID 列に置き換えることでこの SQL を使用できます。

DELETE FROM duplicated_table WHERE OID > (
　SELECT MIN(OID) FROM duplicated_table d2
　　WHERE column1 = d2.dupl_column1
　　AND column2 = d2.column2
);

これは、Amazon Redshift クラスターでテストした例です。

create table auto_id_table (auto_id int IDENTITY, name varchar, age int);

insert into auto_id_table (name, age) values('John', 18);
insert into auto_id_table (name, age) values('John', 18);
insert into auto_id_table (name, age) values('John', 18);
insert into auto_id_table (name, age) values('John', 18);
insert into auto_id_table (name, age) values('John', 18);
insert into auto_id_table (name, age) values('Bob', 20);
insert into auto_id_table (name, age) values('Bob', 20);  
insert into auto_id_table (name, age) values('Matt', 24); 

select * from auto_id_table order by auto_id; 
 auto_id | name | age 
---------+------+-----
       1 | John |  18
       2 | John |  18
       3 | John |  18
       4 | John |  18
       5 | John |  18
       6 | Bob  |  20
       7 | Bob  |  20
       8 | Matt |  24    
(8 rows) 

delete from auto_id_table where auto_id > (
  select min(auto_id) from auto_id_table d
    where auto_id_table.name = d.name
    and auto_id_table.age = d.age
);

select * from auto_id_table order by auto_id;
 auto_id | name | age 
---------+------+-----
       1 | John |  18
       6 | Bob  |  20
       8 | Matt |  24
(3 rows)

また、このような COPY コマンドでも動作します。

auto_id_table.csv
```
John,18
Bob,20
Matt,24
```

SQLをコピー

copy auto_id_table (name, age) from '[s3-path]/auto_id_table.csv' CREDENTIALS 'aws_access_key_id=[your-aws-key-id] ;aws_secret_access_key=[your-aws-secret-key]' delimiter ',';

この方法の利点は、DDL ステートメントを実行する必要がないことです。ただし、ID 列を既存のテーブルに追加することはできないため、ID 列を持たない既存のテーブルでは機能しません。既存のテーブルと重複するレコードを削除する唯一の方法は、このようにすべてのレコードを移行することです。（user1045047の回答と同じ）

insert into temp_table (select distinct from original_table);
drop table original_table;
alter table temp_table rename to original_table;

score 7 · Accepted Answer

私の解決策は、テーブルで「コピー」する前に「削除」コマンドを実行することです。私の使用例では、毎日のスナップショットのレコードを redshift テーブルにコピーする必要があるたびに、次の「delete」コマンドを使用して重複したレコードが削除されていることを確認してから、「copy」コマンドを実行できます。

Snapshot_day = 'xxxx-xx-xx' の t_data から削除します。

amazon-web-services - Amazon S3 から Redshift にデータをコピーし、行の重複を避ける

6 に答える 6

Related

Reference