r - 2 つのテーブルの組み合わせから最大値を見つける (for ループが遅すぎる)

Question

データテーブル「the.data」があります。最初の列は測定器を示し、残りは異なる測定データを示します。

instrument <- c(1,2,3,4,5,1,2,3,4,5)
hour <- c(1,1,1,1,1,2,2,2,2,2)
da <- c(12,14,11,14,10,19,15,16,13,11)
db <- c(21,23,22,29,28,26,24,27,26,22)
the.data <- data.frame(instrument,hour,da,db)

また、インストゥルメントのグループも定義しました。たとえば、グループ 1 (g1) はインストゥルメント 1 と 2 を指します。

g1 <- c(1,2)
g2 <- c(4,3,1)
g3 <- c(1,5,2)
g4 <- c(2,4)
g5 <- c(5,3,1,2,6)
groups <- c("g1","g2","g3","g4","g5")

各グループの合計がデータ型ごとに最大になる時間とその合計を調べる必要があります。

g1 時間 1: 合計 (da)=12+14=26 g1 時間 2: 合計 (da)=19+15=34

したがって、g1 と da の場合、答えは時間 2 と値 34 です。

forループ内のforループでこれを行いましたが、時間がかかりすぎます（数時間後に中断しました）。問題は、the.data の長さが約 100.000 行であり、それぞれ 2 ～ 50 のインストゥルメントを持つ約 5.000 のグループがあることです。

これを行うための良い方法は何ですか？

Stack-overflow へのすべての貢献者に心から感謝します。

更新: 例では 5 つのグループのみになりました。

/クリス

score 4 · Accepted Answer

ループはそのgroupままにしておくか、せいぜい次のようなものに置き換える必要がありますlapply()。ただし、ループは、行列に再フォーマットしてからベクトル化された代数を実行するだけhourで完全に置き換えることができます。instrument x hour例えば：

library(reshape2)

groups = list(g1, g3)

the.data.a = dcast(the.data[,1:3], instrument ~ hour)

> sapply(groups, function(x) data.frame(max = max(colSums(the.data.a[x, -1])),
                                        ind = which.max(colSums(the.data.a[x, -1]))))
    [,1] [,2]
max 34   45  
ind 2    2

score 3 · Accepted Answer

これは、ジョン・コルビーの答えを少し修正したバージョンで、サンプルデータがいくつかあります。

set.seed(21)
instrument <- sample(100, 1e5, TRUE)
hour <- sample(24, 1e5, TRUE)
da <- trunc(runif(1e5)*10)
db <- trunc(runif(1e5)*10)
the.data <- data.frame(instrument,hour,da,db)
groups <- replicate(5000, sample(100, sample(50,1)))
names(groups) <- paste("g",1:length(groups),sep="")

library(reshape2)
system.time({    
the.data.a <- dcast(the.data[,1:3], instrument ~ hour, sum)
out <- t(sapply(groups, function(i) {
  byHour <- colSums(the.data.a[i,-1])
  c(max(byHour), which.max(byHour))
}))
colnames(out) <- c("max.hour","max.sum")
})
# Using da as value column: use value.var to override.
#    user  system elapsed 
#    3.80    0.00    3.81

score 2 · Accepted Answer

Hadleyを使用plyrした1 つのアプローチを次に示します。reshape2まず、the.data楽器がそのグループに含まれているかどうかに応じて、いくつかのブール値を追加します。次に、それを長い形式に溶かし、必要のない行をサブセット化し、 or を使用してグループ化操作を行いddplyますdata.table。

#add boolean columns
the.data <- transform(the.data, 
                      g1 = instrument %in% g1,
                      g2 = instrument %in% g2,
                      g3 = instrument %in% g3,
                      g4 = instrument %in% g4,
                      g5 = instrument %in% g5
                      )

#load library
library(reshape2)
#melt into long format
the.data.m <- melt(the.data, id.vars = 1:4)
#subset out data that that has FALSE for the groupings
the.data.m <- subset(the.data.m, value == TRUE)

#load plyr and data.table
library(plyr)
library(data.table)

#plyr way
ddply(the.data.m, c("variable", "hour"), summarize, out = sum(da))
#data.table way
dt <- data.table(the.data.m)
dt[, list(out = sum(da)), by = "variable, hour"]

ベンチマークを実行して、どちらが速いかを確認します。

library(rbenchmark)   
f1 <- function() ddply(the.data.m, c("variable", "hour"), summarize, out = sum(da))
f2 <- function() dt[, list(out = sum(da)), by = "variable, hour"]

> benchmark(f1(), f2(), replications=1000, order="elapsed", columns = c("test", "elapsed", "relative"))
  test elapsed relative
2 f2()    3.44 1.000000
1 f1()    6.82 1.982558

したがって、この例では、data.table は約 2 倍高速です。マイルは異なる場合があります。

そして、それが正しい値を与えていることを示すために:

> dt[, list(out = sum(da)), by = "variable, hour"]
      variable hour out
 [1,]       g1    1  26
 [2,]       g1    2  34
 [3,]       g2    1  25
 [4,]       g2    2  29

...

score 2 · Accepted Answer

コード (または、グループ数が 5000 の場合に必要と思われるグループを生成するプログラムによる方法) は提供しませんでしたが、これはR のより効果的な使用法である可能性があります。

groups <- list(g1,g2,g3,g4,g5)
gmax <- list()
# The "da" results
for( gitem in seq_along(groups) ) { 
       gmax[[gitem]] <- with( subset(the.data , instrument %in% groups[[gitem]]),  
                               tapply(da , hour, sum) ) }
damat <- matrix(c(sapply(gmax, which.max), 
                  sapply(gmax, max)) , ncol=2)

# The "db" results
for( gitem in seq_along(groups) ) { 
       gmax[[gitem]] <- with( subset(the.data , instrument %in% groups[[gitem]]),  
                               tapply(db , hour, sum) ) }
dbmat <- matrix(c(sapply(gmax, which.max), 
                  sapply(gmax, max)) , ncol=2)

#--------
> damat
     [,1] [,2]
[1,]    2   34
[2,]    2   29
[3,]    2   45
[4,]    1   14
[5,]    2   42
> dbmat
     [,1] [,2]
[1,]    2   50
[2,]    2   53
[3,]    1   72
[4,]    1   29
[5,]    1   73

r - 2 つのテーブルの組み合わせから最大値を見つける (for ループが遅すぎる)

4 に答える 4

Related

Reference