postgresql - Influx から Postgres への移行、ヒントが必要

Question

Influx を使用して時系列データを保存しました。うまくいったときはクールだったのですが、約 1 か月後には動作しなくなり、原因がわかりませんでした。(この問題に似ています https://github.com/influxdb/influxdb/issues/1386 )

Influx はいつか素晴らしいものになるかもしれませんが、今のところはもっと安定したものを使用する必要があります。私はPostgresについて考えています。私たちのデータは多くのセンサーから取得され、各センサーにはセンサー ID があります。したがって、データを次のように構造化することを考えています。

(pk)、センサー ID (文字列)、時間 (タイムスタンプ)、値 (フロート)

Influx は時系列データ用に構築されているため、最適化が組み込まれている可能性があります。Postgres を効率的にするには、自分で最適化を行う必要がありますか? 具体的には、次の質問があります。

Influx にはこの「シリーズ」という概念があり、新しいシリーズを作成するのは安価です。そのため、センサーごとに個別のシリーズがありました。センサーごとに個別の Postgres テーブルを作成する必要がありますか?
クエリを高速化するには、インデックスをどのように設定すればよいですか? 典型的なクエリは次のとおりです: 過去 3 日間の sensor123 のすべてのデータを選択します。
時間列にタイムスタンプまたは整数を使用する必要がありますか?
保持ポリシーを設定するにはどうすればよいですか? たとえば、1 週間以上経過したデータを自動的に削除します。
Postgres は水平方向にスケーリングしますか? データ複製と負荷分散のために ec2 クラスターをセットアップできますか?
Postgres でダウンサンプリングできますか? date_trunc を使用できるいくつかの記事を読みました。しかし、25秒などの特定の間隔にdate_truncできないようです。
私が見逃した他の警告はありますか？

前もって感謝します！

更新時間列を大きな整数として保存する方が、タイムスタンプとして保存するよりも高速です。私は何か間違ったことをしていますか？

タイムスタンプとして保存します：

postgres=# explain analyze select * from test where sensorid='sensor_0';

Bitmap Heap Scan on test  (cost=3180.54..42349.98 rows=75352 width=25) (actual time=10.864..19.604 rows=51840 loops=1)
   Recheck Cond: ((sensorid)::text = 'sensor_0'::text)
   Heap Blocks: exact=382
   ->  Bitmap Index Scan on sensorindex  (cost=0.00..3161.70 rows=75352 width=0) (actual time=10.794..10.794 rows=51840 loops=1)
         Index Cond: ((sensorid)::text = 'sensor_0'::text)
 Planning time: 0.118 ms
 Execution time: 22.984 ms

postgres=# explain analyze select * from test where sensorid='sensor_0' and addedtime > to_timestamp(1430939804);

 Bitmap Heap Scan on test  (cost=2258.04..43170.41 rows=50486 width=25) (actual time=22.375..27.412 rows=34833 loops=1)
   Recheck Cond: (((sensorid)::text = 'sensor_0'::text) AND (addedtime > '2015-05-06 15:16:44-04'::timestamp with time zone))
   Heap Blocks: exact=257
   ->  Bitmap Index Scan on sensorindex  (cost=0.00..2245.42 rows=50486 width=0) (actual time=22.313..22.313 rows=34833 loops=1)
         Index Cond: (((sensorid)::text = 'sensor_0'::text) AND (addedtime > '2015-05-06 15:16:44-04'::timestamp with time zone))
 Planning time: 0.362 ms
 Execution time: 29.290 ms

それを大きな整数として保存します：

postgres=# explain analyze select * from test where sensorid='sensor_0';


 Bitmap Heap Scan on test  (cost=3620.92..42810.47 rows=85724 width=25) (actual time=12.450..19.615 rows=51840 loops=1)
   Recheck Cond: ((sensorid)::text = 'sensor_0'::text)
   Heap Blocks: exact=382
   ->  Bitmap Index Scan on sensorindex  (cost=0.00..3599.49 rows=85724 width=0) (actual time=12.359..12.359 rows=51840 loops=1)
         Index Cond: ((sensorid)::text = 'sensor_0'::text)
 Planning time: 0.130 ms
 Execution time: 22.331 ms

postgres=# explain analyze select * from test where sensorid='sensor_0' and addedtime > 1430939804472;


 Bitmap Heap Scan on test  (cost=2346.57..43260.12 rows=52489 width=25) (actual time=10.113..14.780 rows=31839 loops=1)
   Recheck Cond: (((sensorid)::text = 'sensor_0'::text) AND (addedtime > 1430939804472::bigint))
   Heap Blocks: exact=235
   ->  Bitmap Index Scan on sensorindex  (cost=0.00..2333.45 rows=52489 width=0) (actual time=10.059..10.059 rows=31839 loops=1)
         Index Cond: (((sensorid)::text = 'sensor_0'::text) AND (addedtime > 1430939804472::bigint))
 Planning time: 0.154 ms
 Execution time: 16.589 ms

score 2 · Accepted Answer

センサーごとにテーブルを作成しないでください。代わりに、どのシリーズに含まれているかを識別するフィールドをテーブルに追加できます。シリーズに関する追加の属性を説明する別のテーブルを作成することもできます。データポイントが複数の系列に属する可能性がある場合は、まったく別の構造が必要になります。

q2 で説明したクエリの場合、recorded_at 列のインデックスが機能するはずです (time は SQL 予約キーワードであるため、名前として使用しないことをお勧めします)。

時間データ型として TIMESTAMP WITH TIME ZONE を使用する必要があります。

保持はあなた次第です。

Postgres には、シャーディング/レプリケーションのためのさまざまなオプションがあります。それは大きな話題です。

＃6の目的はよくわかりませんが、何かを理解できると確信しています.

postgresql - Influx から Postgres への移行、ヒントが必要

1 に答える 1

Related

Reference