r - R - サブサンプリング手順の合理化

Question

研究と実験から得られた値で構成されるデータセットがあります。実験はスタディ内にネストされています。各研究で 1 つの実験のみが表されるように、データセットをサブサンプリングしたいと考えています。この手順を 10,000 回繰り返し、毎回 1 つの実験をランダムに描画してから、値の要約統計を計算します。データセットの例を次に示します。

df=data.frame(study=c(1,1,2,2,2,3,4,4),expt=c(1,2,1,2,3,1,1,2),value=runif(8))

上記を行うために次の関数を書きましたが、永遠にかかります。このコードを合理化するための提案はありますか? ありがとう！

subsample=function(x,A) {
  subsample.list=sapply(1:A,function(m) {
    idx=ddply(x,c("study"),function(i) sample(1:nrow(i),1)) #Sample one experiment from each study
    x[paste(x$study,x$expt,sep="-") %in% paste(idx$study,idx$V1,sep="-"),"value"] } ) #Match the study-experiment combinations and retrieve values
  means.list=ldply(subsample.list,mean) #Calculate the mean of 'values' for each iteration
  c(quantile(means.list$V1,0.025),mean(means.list$V1),upper=quantile(means.list$V1,0.975)) } #Calculate overall means and 95% CIs

score 1 · Accepted Answer

速度上の理由から ddply を回避するベース R ソリューションを次に示します。

df=data.frame(study=c(1,1,2,2,2,3,4,4),expt=c(1,2,1,2,3,1,1,2),value=runif(8))

sample.experiments <- function(df) {
    r <- rle(df$study)
    samp <- sapply( r$lengths , function(x) sample(seq(x),1) )
    start.idx <- c(0,cumsum(r$lengths)[1:(length(r$lengths)-1)] )
    df[samp + start.idx,]
}
> sample.experiments(df)
  study expt     value
1     1    1 0.6113196
4     2    2 0.5026527
6     3    1 0.2803080
7     4    1 0.9824377

ベンチマーク

> m <- microbenchmark(
+   ddply(df,.(study),function(i) i[sample(1:nrow(i),1),]) ,
+   sample.experiments(df)
+   )
> m
Unit: microseconds
                                                        expr      min       lq   median       uq      max
1 ddply(df, .(study), function(i) i[sample(1:nrow(i), 1), ]) 3808.652 3883.632 3936.805 4022.725 6530.506
2                                     sample.experiments(df)  337.327  350.734  357.644  365.915  580.097

autoplot マイクロベンチマーク

score 1 · Accepted Answer

この方法でさらにベクトル化することができ (plyr を使用しても)、はるかに高速に実行できます。

function=yoursummary(x)c(quantile(x,0.025),mean(x),upper=quantile(x,0.975))
subsampleX=function(x,M)
  yoursummary(
    aaply(
      daply(.drop_o=F,df,.(study),
        function(x)sample(x$value,M,replace=T)
      ),1,mean
    )
  )

ここでのコツは、すべてのサンプリングを前もって行うことです。M 回サンプリングしたい場合は、調査にアクセスできる間にすべてを実行してみませんか。

元のコード:

> system.time(subsample(df,20000))
   user  system elapsed 
 123.23    0.06  124.74

新しいベクトル化されたコード:

> system.time(subsampleX(df,20000))
   user  system elapsed 
   0.24    0.00    0.25

これは約 500 倍高速です。

r - R - サブサンプリング手順の合理化

2 に答える 2

元のコード:

新しいベクトル化されたコード:

Related

Reference