r - R の長い data.frames での関数の効率的な使用

Question

マストからの気象データを含む長いデータフレームがあります。これには、異なる高さ ( )data$valueで異なるパラメータ (風速、風向、気温など) で同時に取得された観測 ( ) が含まれています。data$paramdata$z

このデータをで効率的にスライスし$time、収集したすべてのデータに関数を適用しようとしています。通常、関数は一度に 1 つの関数に適用され$paramます (つまり、気温とは異なる関数を風速に適用します)。

現在のアプローチ

私の現在の方法は、とを使用することdata.frameですddply。

すべての風速データを取得したい場合は、次を実行します。

# find good data ----
df <- data[((data$param == "wind speed") &
                  !is.na(data$value)),]

df次に、次を使用して関数を実行しddply()ます。

df.tav <- ddply(df,
               .(time),
               function(x) {
                      y <-data.frame(V1 = sum(x$value) + sum(x$z),
                                     V2 = sum(x$value) / sum(x$z))
                      return(y)
                    })

通常、V1 と V2 は他の関数への呼び出しです。これらはほんの一例です。ただし、同じデータに対して複数の関数を実行する必要があります。

質問

私の現在のアプローチは非常に遅いです。ベンチマークはしていませんが、コーヒーを飲みに行って 1 年分のデータが処理される前に戻ることができるほど遅いです。

処理する注文 (数百) の塔があり、それぞれに 1 年間のデータと 10 ～ 12 の高さがあるため、より高速なものを探しています。

データサンプル

data <-  structure(list(time = structure(c(1262304600, 1262304600, 1262304600, 
1262304600, 1262304600, 1262304600, 1262304600, 1262304600, 1262304600, 
1262304600, 1262304600, 1262304600, 1262304600, 1262304600, 1262304600, 
1262304600, 1262304600, 1262304600, 1262304600, 1262304600, 1262304600, 
1262304600, 1262304600, 1262304600, 1262304600, 1262304600, 1262304600, 
1262304600, 1262304600, 1262304600, 1262304600, 1262304600, 1262304600, 
1262305200, 1262305200, 1262305200, 1262305200, 1262305200, 1262305200, 
1262305200), class = c("POSIXct", "POSIXt"), tzone = ""), z = c(0, 
0, 0, 100, 100, 100, 120, 120, 120, 140, 140, 140, 160, 160, 
160, 180, 180, 180, 200, 200, 200, 40, 40, 40, 50, 50, 50, 60, 
60, 60, 80, 80, 80, 0, 0, 0, 100, 100, 100, 120), param = c("temperature", 
"humidity", "barometric pressure", "wind direction", "turbulence", 
"wind speed", "wind direction", "turbulence", "wind speed", "wind direction", 
"turbulence", "wind speed", "wind direction", "turbulence", "wind speed", 
"wind direction", "turbulence", "wind speed", "wind direction", 
"turbulence", "wind speed", "wind direction", "turbulence", "wind speed", 
"wind direction", "turbulence", "wind speed", "wind direction", 
"turbulence", "wind speed", "wind direction", "turbulence", "wind speed", 
"temperature", "barometric pressure", "humidity", "wind direction", 
"wind speed", "turbulence", "wind direction"), value = c(-2.5, 
41, 816.9, 248.4, 0.11, 4.63, 249.8, 0.28, 4.37, 255.5, 0.32, 
4.35, 252.4, 0.77, 5.08, 248.4, 0.65, 3.88, 313, 0.94, 6.35, 
250.9, 0.1, 4.75, 253.3, 0.11, 4.68, 255.8, 0.1, 4.78, 254.9, 
0.11, 4.7, -3.3, 816.9, 42, 253.2, 2.18, 0.27, 229.5)), .Names = c("time", 
"z", "param", "value"), row.names = c(NA, 40L), class = "data.frame")

score 14 · Accepted Answer

使用data.table:

library(data.table)
dt = data.table(data)

setkey(dt, param)  # sort by param to look it up fast

dt[J('wind speed')][!is.na(value),
                    list(sum(value) + sum(z), sum(value)/sum(z)),
                    by = time]
#                  time      V1         V2
#1: 2009-12-31 18:10:00 1177.57 0.04209735
#2: 2009-12-31 18:20:00  102.18 0.02180000

パラメータごとに異なる関数を適用したい場合は、より統一されたアプローチがあります。

# make dt smaller because I'm lazy
dt = dt[param %in% c('wind direction', 'wind speed')]

# now let's start - create another data.table
# that will have param and corresponding function
fns = data.table(p = c('wind direction', 'wind speed'),
                 fn = c(quote(sum(value) + sum(z)), quote(sum(value) / sum(z))),
                 key = 'p')
fns
                p     fn
1: wind direction <call>    # the fn column contains functions
2:     wind speed <call>    # i.e. this is getting fancy!

# now we can evaluate different functions for different params,
# sliced by param and time
dt[!is.na(value), {param; eval(fns[J(param)]$fn[[1]], .SD)},
   by = list(param, time)]
#            param                time           V1
#1: wind direction 2009-12-31 18:10:00 3.712400e+03
#2: wind direction 2009-12-31 18:20:00 7.027000e+02
#3:     wind speed 2009-12-31 18:10:00 4.209735e-02
#4:     wind speed 2009-12-31 18:20:00 2.180000e-02

PS 動作する前にparam何らかの方法で使用しなければならないという事実はバグだと思います。evaleval

更新:バージョン 1.8.11 以降、このバグは修正され、次のように機能します。

dt[!is.na(value), eval(fns[J(param)]$fn[[1]], .SD), by = list(param, time)]

score 9 · Accepted Answer

dplyr を使用します。まだ開発中ですが、plyr よりもはるかに高速です。

# devtools::install_github(dplyr)
library(dplyr)

windspeed <- subset(data, param == "wind speed")
daily <- group_by(windspeed, time)

summarise(daily, V1 = sum(value) + sum(z), V2 = sum(value) / sum(z))

dplyr のもう 1 つの利点は、data.table の特殊な構文について何も知らなくても、データテーブルをバックエンドとして使用できることです。

library(data.table)
daily_dt <- group_by(data.table(windspeed), time)
summarise(daily_dt, V1 = sum(value) + sum(z), V2 = sum(value) / sum(z))

(データフレームを使用した dplyr は plyr よりも 20 ～ 100 倍高速であり、data.table を使用した dplyr はさらに 10 倍高速です)。dplyr は data.table ほど簡潔ではありませんが、データ分析の主要なタスクごとに機能を備えているため、コードが理解しやすくなります。dplyr 操作のシーケンスを他の人に読み取らせて、何が起こっているのか理解してもらいます。

変数ごとに異なる集計を行いたい場合は、データ構造を「きちんとした」ものに変更することをお勧めします。

library(reshape2)
data_tidy <- dcast(data, ... ~ param)

daily_tidy <- group_by(data_tidy, time)
summarise(daily_tidy, 
  mean.pressure = mean(`barometric pressure`, na.rm = TRUE),
  sd.turbulence = sd(`barometric pressure`, na.rm = TRUE)
)

r - R の長い data.frames での関数の効率的な使用

現在のアプローチ

質問

データサンプル

2 に答える 2

Related

Reference