r - 行ごとの data.table 関数が遅すぎる

Question

行ごとに加重平均 (6M 行以上) を計算する必要がありますが、非常に時間がかかります。重みのある列は文字フィールドであるため、weighted.mean を直接使用することはできません。

バックグラウンドデータ：

library(data.table)
library(stringr)
values <- c(1,2,3,4)
grp <- c("a", "a", "b", "b")
weights <- c("{10,0,0,0}", "{0,10,0,0}", "{10,10,0,0}", "{0,0,10,0}")
DF <- data.frame(cbind(grp, weights))
DT <- data.table(DF)

string.weighted.mean <- function(weights.x) {
  tmp.1 <- na.omit(as.numeric(unlist(str_split(string=weights.x, pattern="[^0-9]+"))))
  tmp.2 <- weighted.mean(x=values, w=tmp.1)
}

これは、data.frames を使用して (遅すぎる) 行う方法です。

DF$wm <- mapply(string.weighted.mean, DF$weights)

これは仕事をしますが、遅すぎます（時間）：

DT[, wm:=mapply(string.weighted.mean, weights)]

物事をスピードアップするために最後の行をどのように言い換えることができますか?

score 6 · Accepted Answer

DT[, rowid := 1:nrow(DT)]
setkey(DT, rowid)
DT[, wm :={
    weighted.mean(x=values, w=na.omit(as.numeric(unlist(str_split(string=weights, pattern="[^0-9]+")))))     
}, by=rowid]

score 2 · Accepted Answer

グループは加重平均の計算とは何の関係もないように見えるので、問題を少し単純化しようとしました。

     values <- seq(4)

# A function to compute a string of length 4 with random weights 0 or 10
     tstwts <- function()
     {
         w <- sample( c(0, 10), 4, replace = TRUE )
         paste0( "{", paste(w, collapse = ","), "}" )
     }

# Generate 100K strings and put them into a vector
     u <- replicate( 1e5, tstwts() )
     head(u)   # Check
     table(u)

# Function to compute a weighted mean from a string using values 
# as an assumed external numeric vector 'values' of the same length as
# the weights
    f <- function(x)
         {
             valstr <- gsub( "[\\{\\}]", "", x )
             wts <- as.numeric( unlist( strsplit(valstr, ",") ) )
             sum(wts * values) / sum(wts) 
         }

# Execute the function f recursively on the vector of weights u
    v <- sapply(u, f)

# Some checks:
    head(v)
    table(v)

私のシステムでは、100K の繰り返しに対して、

> system.time(sapply(u, f))
   user  system elapsed 
   3.79    0.00    3.83

これのデータテーブルバージョン (sans groups) は次のようになります。

DT <- data.table( weights = u )
DT[, wt.mean := lapply(weights, f)] )
head(DT)
dim(DT)

私のシステムでは、これには

system.time( DT[, wt.mean := lapply( weights, f )] ) ユーザーシステムの経過時間 3.62 0.03 3.69

そのため、私のものに匹敵するシステム（Win7、2.8GHzデュアルコアチップ、8GB RAM）では、100万観測あたり約35〜40秒になると予想されます。YMMV。

r - 行ごとの data.table 関数が遅すぎる

2 に答える 2

Related

Reference