r - グループごとにdata.frameの転置を高速化するには?

Question

私はこれdata.frameを同じ長さのグループで持っています ( id)

id  |  amount 
--------------
 A  |   10   
 A  |   54   
 A  |   23   
 B  |   34   
 B  |   76    
 B  |   12

グループ ごとにこれに転置したいと思いidます：

 id |
----------------------
 A  | 10  |  54 | 23  
 B  | 34  |  76 | 12

これを行う最も効率的な方法は何ですか?

以前に使用reshapeしdcastたことがありますが、実際には非常に遅いです。(大量のデータがあり、このボトルネックをスピードアップしたいと考えています)

より良い戦略はありますか？data.tableまたは行列の使用?? どんな助けでも大歓迎です！

# Little data.frame
df <- data.frame(id=c(2,2,2,5,5,5), amount=as.integer(c(10,54,23,34,76,12)))

# Not so little data.frame
set.seed(10)
df <- data.frame(id = rep(sample(1:10000, 10000, replace=F),100), amount=as.integer(floor(runif(1000000, -100000,100000))))

# Create time variable
df$time <- ave(as.numeric(df$id), df$id, FUN = seq_along)

# The base R reshape strategy
system.time(df.reshape <-reshape(df, direction = "wide", idvar="id", timevar="time"))
user  system elapsed 
6.36    0.31    6.69 

# The reshape2 dcast strategy
require(reshape2)
a <- system.time(mm <- melt(df,id.vars=c('id','time'),measure.vars=c('amount')))
b <- system.time(df.dcast <- dcast(mm,id~variable+time,fun.aggregate=mean))
a+b
user  system elapsed 
14.44    0.00   14.45

更新各グループの長さが等しいという事実を使用すると、関数を使用できmatrixます。

df.matrix <- data.frame(id=unique(df$id), matrix(df$amount, nrow=(length(unique(df$id))), byrow=T))
user  system elapsed 
0.03    0.00    0.03

注:このメソッドは、data.frame がによって事前に並べ替えられていることを前提としていidます。

score 2 · Accepted Answer

マトリックスアプローチでは、次を使用します。

  system.time({ df.reshape <-matrix(df$amount, nrow=10000, byrow=TRUE); 
               rownames(df.reshape)<- df$id[1:10000]
             } )
   user  system elapsed 
  0.010   0.006   0.016

score 1 · Accepted Answer

これを試してください：

 dFrame<-data.frame(id = c(rep("A",3),rep("B",3)),amount = c(10,54,23,34,76,12))
 newFrame<-cbind(data.frame(id = unique(dFrame$id)),matrix(as.numeric(unlist(tapply(dFrame$amount,dFrame$id,identity))),nrow=length(unique(dFrame$id)),byrow=T))

ブラケットがオフになっている可能性があります。注意しようとしましたが、現時点では R インタープリターを利用できません。

提供した df サンプルコードに基づくベンチマーク結果:

  replications elapsed relative user.self sys.self user.child sys.child
   1            1   4.193        1     4.056    0.064          0         0

r - グループごとにdata.frameの転置を高速化するには?

3 に答える 3

Related

Reference