r - RCpp は基本的な R 関数の評価を高速化しますか?

Question

申し訳ありませんが、私は Rcpp についてあまり知りませんが、私が書いているパッケージを改善するために Rcpp を学ぶことが良いかどうかを理解しようとしています。

MCMCアルゴリズムを使用して、高次元の制約された空間から均一に効率的かつランダムにサンプリングする（想定されている）Rパッケージを作成しました。これは (未完成で) https://github.com/davidkane9/kmatchingにあります。

問題は、Gelman-Rubin 診断と呼ばれる統計テストを実行して、MCMC アルゴリズムが定常分布に収束したかどうかを確認するときです。統計の R = 1 を取得する必要がありますが、非常に高い数値が得られます。サンプリングは悪いので、誰も使用すべきではありません。解決策は、より多くのサンプルを取得し、さらにスキップすることです (100 個ごとではなく、1000 個のうち 1 個を取得します)。ただし、これには多くの時間がかかります。コードを実行したい場合は、次の例をご覧ください。

##install the package first
data(lalonde)
matchvars = c("age", "educ", "black")
k = kmatch(x = lalonde, weight.var = "treat", match.var = matchvars, n = 1000, skiplength = 1000, chains = 2, verbose = TRUE)

これの Rprof 出力を見ると、それが得られrnorm、%*%ほとんどの時間がかかっています。

                       total.time total.pct self.time self.pct
"kmatch"                  1453.14    100.00      0.00     0.00
"hitandrun"               1450.18     99.79    128.80     8.86
"%*%"                      757.00     52.09    757.00    52.09
"cat"                      343.18     23.62    329.82    22.70
"rnorm"                    106.34      7.32    103.50     7.12
"mirror"                    35.26      2.43     21.84     1.50
"paste"                     14.02      0.96     14.02     0.96
"stdout"                    13.36      0.92     13.36     0.92
"runif"                     13.32      0.92     13.32     0.92
"/"                         12.82      0.88     12.82     0.88
">"                          7.42      0.51      7.42     0.51
"<"                          6.22      0.43      6.22     0.43
"-"                          5.78      0.40      5.78     0.40
"max"                        5.18      0.36      5.18     0.36
"nchar"                      5.12      0.35      5.12     0.35
"*"                          4.84      0.33      4.84     0.33
"min"                        3.94      0.27      3.94     0.27
"sum"                        3.42      0.24      3.42     0.24
"gelman.diag"                2.90      0.20      0.00     0.00
"=="                         2.86      0.20      2.86     0.20
"ncol"                       2.84      0.20      2.84     0.20
"apply"                      2.72      0.19      0.26     0.02
"+"                          2.48      0.17      2.48     0.17
"FUN"                        2.32      0.16      1.66     0.11
"^"                          2.08      0.14      2.08     0.14
":"                          1.24      0.09      1.24     0.09
"sqrt"                       0.96      0.07      0.96     0.07
"%%"                         0.90      0.06      0.90     0.06
"mean.default"               0.62      0.04      0.62     0.04
"lapply"                     0.40      0.03      0.26     0.02
"("                          0.32      0.02      0.32     0.02
"unlist"                     0.26      0.02      0.00     0.00
"array"                      0.12      0.01      0.02     0.00
"sapply"                     0.12      0.01      0.00     0.00
"matrix"                     0.06      0.00      0.02     0.00
"Null"                       0.04      0.00      0.04     0.00
"print"                      0.04      0.00      0.00     0.00
"unique"                     0.04      0.00      0.00     0.00
"abs"                        0.02      0.00      0.02     0.00
"all"                        0.02      0.00      0.02     0.00
"aperm.default"              0.02      0.00      0.02     0.00
"as.matrix.mcmc"             0.02      0.00      0.02     0.00
"file.exists"                0.02      0.00      0.02     0.00
"list"                       0.02      0.00      0.02     0.00
"print.default"              0.02      0.00      0.02     0.00
"stopifnot"                  0.02      0.00      0.02     0.00
"unique.default"             0.02      0.00      0.02     0.00
"which.min"                  0.02      0.00      0.02     0.00
"<Anonymous>"                0.02      0.00      0.00     0.00
"aperm"                      0.02      0.00      0.00     0.00
"as.mcmc.list"               0.02      0.00      0.00     0.00
"as.mcmc.list.default"       0.02      0.00      0.00     0.00
"data"                       0.02      0.00      0.00     0.00
"mcmc.list"                  0.02      0.00      0.00     0.00
"print.gelman.diag"          0.02      0.00      0.00     0.00
"quantile.default"           0.02      0.00      0.00     0.00
"sort"                       0.02      0.00      0.00     0.00
"sort.default"               0.02      0.00      0.00     0.00
"sort.int"                   0.02      0.00      0.00     0.00
"summary"                    0.02      0.00      0.00     0.00
"summary.default"            0.02      0.00      0.00     0.00

verbose = F に設定すると、エラーcatはなくなり%*%ますが、約 70% の時間がかかります。コードを C++ で記述してから RCpp を使用する価値があるかどうか、または非常に時間がかかる関数が基本的な関数 (既に C で記述されている) であるため、価値がないのではないかと考えています。それと一緒に暮らすか、より良いアルゴリズムを見つける必要があります.

編集：Rprofによると、私を支えている1行は次のとおりu = Z %*% rですhitandrun：

## This is the loop that is being run millions of times and taking forever
for(i in 1:(n*skiplength+discard)) {
        tmin<-0;tmax<-0;
        ## runs counts how many times tried to pick a direction, if
        ## too high fail.
        runs = 0
        while(tmin ==0 && tmax ==0) {
          ## r is a random unit vector in with basis in Z
          r <- rnorm(ncol(Z))
          r <- r/sqrt(sum(r^2))

          ## u is a unit vector in the appropriate k-plane pointing in a
          ## random direction Z %*% r is the same as in mirror
          u <- Z%*%r
          c <- y/u
          ## determine intersections of x + t*u with walls
          ## the limits on how far you can go backward and forward
          ## i.e. the maximum and minimum ratio y_i/u_i for negative and positive u.
          tmin <- max(-c[u>0]); tmax <- min(-c[u<0]);
          ## unboundedness
          if(tmin == -Inf || tmax == Inf){
            stop("problem is unbounded")
          }
          ## if stuck on boundary point
          if(tmin==0 && tmax ==0) {
            runs = runs + 1
            if(runs >= 1000) stop("hitandrun found can't find feasible direction, cannot generate points")
          }
        }

        ## chose a point on the line segment
        y <- y + (tmin + (tmax - tmin)*runif(1))*u;

        ## choose a point every 'skiplength' samples
        if(i %% skiplength == 0) {
          X[,index] <- y
          index <- index + 1
        }
        if(verbose) for(j in 1:nchar(str)) cat("\b")
        str <- paste(i)
        if(verbose) cat(str)
      }

実際には、サンプリングループで行列乗算を行うのはこれだけですが、サンプルごとに 1 回、100 万個のサンプルを取得して 99% を捨てるという、何千回も実行します。

score 2 · Accepted Answer

実際、Rcpp はまさにこの目的のために多く使用されてきました: MCMC. 通常、30 から 50 または 70 のオーダーで、かなりまともな速度の向上が得られます。

初期のパッケージの 1 つに Whit のrcppbugsがあります。Whit は、彼が作成したいくつかのクラスでプログラミングした後、一般的な使いやすさのために Rcpp に変換しました。「Rcpp MCMC」を気軽に Web 検索すると、いくつかの投稿が表示されます。

また、他の作成者もこれに Rcpp を使用しています。また、(R)Stan の内部にもあります。これは、MCMC に固有のループ構造を可能な限り高速に実行したいからです。したがって、コンパイルされました。

先週、rcpp-devel メーリングリストに、明日行う R ユーザーグループの簡単なプレゼンテーションで何を話すべきか尋ねたところ、多かれ少なかれ「MCMC」の提案が優勢でした。別の RUG からの 1 つの完全な講演も発表されました。スレッドにリンクしたいのですが、どういうわけか、rcpp-devel の Gmane のアーカイブから削除されました。

要するに、はい、ここで Rcpp の使用を検討したいと思います。

r - RCpp は基本的な R 関数の評価を高速化しますか?

1 に答える 1

Related

Reference