私はこれを最も効率的な方法で行っているとはとても思えないので、plpgsql
ここにタグを付けました。これを1,000 の測定システムに対して20 億行で実行する必要があります。
接続が失われたときに以前の値を報告することが多い測定システムがあり、頻繁に、場合によっては長期間にわたって接続が失われます。集計する必要がありますが、その場合は、それがどのくらい繰り返されているかを見て、その情報に基づいてさまざまなフィルターを作成する必要があります。車でmpgを測定しているとしますが、20.1などに移動するよりも1時間20 mpgにとどまっています。詰まったときの精度を評価する必要があります。車が高速道路上にあるときを探すいくつかの代替ルールを配置することもできます。ウィンドウ関数を使用して、車の「状態」を生成し、何かをグループ化することができます。難しい話は抜きにして:
--here's my data, you have different systems, the time of measurement, and the actual measurement
--as well, the raw data has whether or not it's a repeat (hense the included window function
select * into temporary table cumulative_repeat_calculator_data
FROM
(
select
system_measured, time_of_measurement, measurement,
case when
measurement = lag(measurement,1) over (partition by system_measured order by time_of_measurement asc)
then 1 else 0 end as repeat
FROM
(
SELECT 5 as measurement, 1 as time_of_measurement, 1 as system_measured
UNION
SELECT 150 as measurement, 2 as time_of_measurement, 1 as system_measured
UNION
SELECT 5 as measurement, 3 as time_of_measurement, 1 as system_measured
UNION
SELECT 5 as measurement, 4 as time_of_measurement, 1 as system_measured
UNION
SELECT 5 as measurement, 1 as time_of_measurement, 2 as system_measured
UNION
SELECT 5 as measurement, 2 as time_of_measurement, 2 as system_measured
UNION
SELECT 5 as measurement, 3 as time_of_measurement, 2 as system_measured
UNION
SELECT 5 as measurement, 4 as time_of_measurement, 2 as system_measured
UNION
SELECT 150 as measurement, 5 as time_of_measurement, 2 as system_measured
UNION
SELECT 5 as measurement, 6 as time_of_measurement, 2 as system_measured
UNION
SELECT 5 as measurement, 7 as time_of_measurement, 2 as system_measured
UNION
SELECT 5 as measurement, 8 as time_of_measurement, 2 as system_measured
) as data
) as data;
--unfortunately you can't have window functions within window functions, so I had to break it down into subquery
--what we need is something to partion on, the 'state' of the system if you will, so I ran a running total of the nonrepeats
--this creates a row that stays the same when your data is repeating - aka something you can partition/group on
select * into temporary table cumulative_repeat_calculator_step_1
FROM
(
select
*,
sum(case when repeat = 0 then 1 else 0 end) over (partition by system_measured order by time_of_measurement asc) as cumlative_sum_of_nonrepeats_by_system
from cumulative_repeat_calculator_data
order by system_measured, time_of_measurement
) as data;
--finally, the query. I didn't bother showing my desired output, because this (finally) got it
--I wanted a sequential count of repeats that restarts when it stops repeating, and starts with the first repeat
--what you can do now is take the average measurement under some condition based on how long it was repeating, for example
select *,
case when repeat = 0 then 0
else
row_number() over (partition by cumlative_sum_of_nonrepeats_by_system, system_measured order by time_of_measurement) - 1
end as ordered_repeat
from cumulative_repeat_calculator_step_1
order by system_measured, time_of_measurement
では、巨大なテーブルでこれを実行するには、どのように変更しますか? または、どの代替ツールを使用しますか? これはデータベース内またはデータ挿入プロセス中に行う必要があると思われるため、plpgsql を考えていますが、通常はデータが読み込まれた後にデータを操作します。サブクエリに頼らずにこれを 1 回のスイープで取得する方法はありますか?
別の方法を1 つテストしましたが、それでもサブクエリに依存しており、こちらの方が高速だと思います。その方法では、start_timestamp、end_timestamp、system を使用して「開始と停止」テーブルを作成します。次に、より大きなテーブルに参加し、タイムスタンプがそれらの間にある場合は、その状態にあると分類します。これは、本質的にcumlative_sum_of_nonrepeats_by_system
. しかし、これを行うと、数千のデバイスと数千または数百万の「イベント」に対して 1=1 で参加します。そのほうがいいと思いませんか?