r - 切断正規分布からの効率的な乱数生成

Question

平均=0およびsd-1の正規分布から50,000個の値をサンプリングしたいと思います。ただし、値を[-3,3]に制限したいと思います。私はこれを行うためのコードを書きましたが、それが最も効率的かどうかわかりませんか？いくつかの提案を得ることを望んでいました。

lower <- -3 
upper <- 3
x_norm<-rnorm(75000,0,1)
x_norm<-x_norm[which(x_norm >=lower & x_norm<=upper)]
repeat{
    x_norm<-c(x_norm, rnorm(10000,0,1))
    x_norm<-x_norm[which(x_norm >=lower & x_norm<=upper)]
    if(length(x_norm) >= 50000){break}
}
x_norm<-x_norm[1:50000]

score 15 · Accepted Answer

Something like your code will most definitely work but you're greatly overestimating how many values you need. Given that it's a known distribution and a fairly large number of samples you know how many will appear more or less than 3.

(1-pnorm(3))*2 * 50000
[1] 134.9898

したがって、50,000のドローで範囲外になる可能性が高いのは約135であるため、それよりも少し多くドローするのは非常に簡単ですが、それでも過度に大きな数を描画してトリミングすることはできません。3未満または3より大きい50,500の最初の50,000を取得します。

x <- rnorm(50500)
x <- x[x < 3 & x > -3]
x <- x[1:50000]

最初の2行を40,000回実行しましたが、毎回50000を超える長さが返されました。小さなブールチェックは、常に実行することを保証できます。

x <- 1
while (length(x) < 50000){
    x <- rnorm(50500)
    x <- x[x < 3 & x > -3]}
x <- x[1:50000]

私にとって、これは6ミリ秒でほぼ100％の時間実行されます。これは、Rで実行する簡単な方法であり、非常に高速に実行され、読みやすく、アドオンを必要としません。

score 11 · Accepted Answer

効率を本当に気にするなら、この短いRcppコードを打ち負かすのは難しいでしょう。以下をファイルに保存します/tmp/rnormClamp.cpp。

#include <Rcpp.h>

using namespace Rcpp;

// [[Rcpp::export]]
NumericVector rnormClamp(int N, int mi, int ma) {
    NumericVector X = rnorm(N, 0, 1);
    return clamp(mi, X, ma);
}

/*** R
  system.time(X <- rnormClamp(50000, -3, 3))
  summary(X)
*/

sourceCpp()（Rcppからも）を使用してビルドおよび実行します。私のコンピューターでは、実際の描画とクランプには約4ミリ秒かかります。

R> sourceCpp("/tmp/rnormClamp.cpp")

R>   system.time(X <- rnormClamp(50000, -3, 3))
   user  system elapsed 
  0.004   0.000   0.004 

R>   summary(X)
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
-3.00000 -0.67300 -0.00528  0.00122  0.68500  3.00000 
R>

clamp()砂糖関数は、Romainによるこの以前のSO回答で取り上げられました。これは、Rcppのバージョン0.10.2が必要であることも示しています。

Edit: Per Ben's hint, I seemed to have misunderstood. Here is a mix of C++ and R:

// [[Rcpp::export]]
List rnormSelect(int N, int mi, int ma) {
  RNGScope scope;
  int N2 = N * 1.25;
  NumericVector X = rnorm(N2, 0, 1);
  LogicalVector ind = (X < mi) | (X > ma);
  return List::create(X, ind);
}

which one can append to the earlier file. Then:

R>   system.time({ Z <- rnormSelect(50000, -3, 3); 
+                  X <- Z[[1]][ ! Z[[2]] ]; X <- X[1:50000]})
   user  system elapsed 
  0.008   0.000   0.009 

R>   summary(X)
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
-3.00000 -0.68200 -0.00066 -0.00276  0.66800  3.00000 
R>

I'll revisit to the logical indexing and the row subset which I'll have to look up. Maybe tomorrow. But 9 milliseconds is still not too bad :)

Edit 2: Looks like we really don't have logical indexing. We'll have to add this. This version does it 'by hand' but is not that much faster than indexing from R:

// [[Rcpp::export]]
NumericVector rnormSelect2(int N, int mi, int ma) {
  RNGScope scope;
  int N2 = N * 1.25;
  NumericVector X = rnorm(N2, 0, 1);
  LogicalVector ind = (X >= mi) & (X <= ma);
  NumericVector Y(N);
  int k=0;
  for (int i=0; i<N2 & k<N; i++) {
    if (ind[i]) Y(k++) = X(i);
  }
  return Y;
}

And the output:

R>   system.time(X <- rnormSelect2(50000, -3, 3)) 
   user  system elapsed 
  0.004   0.000   0.007 

R>   summary(X)
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
-2.99000 -0.66900 -0.00258  0.00223  0.66700  2.99000 

R>   length(X)
[1] 50000
R>

score 10 · Accepted Answer

JohnとDirkは、棄却サンプリングの良い例を示しました。これは、与えられた質問には問題ないはずです。しかし、別のアプローチを与えるために、累積分布関数とその逆関数（またはその逆近似）がある場合、一様分布からデータを生成して変換することができます。

x <- qnorm( runif(50000, pnorm(-3), pnorm(3)) )
range(x)
hist(x)

与えられた質問に対して、これが棄却サンプリング法よりもはるかに優れているとは思いませんが（切断された正規分布の0.1から2から3の間のデータを生成したい場合は、この方法はおそらくはるかに優れています）もっと効率的。これは、累積とその逆（この場合はpnormとqnorm）に依存するため、簡単に利用できるものがない分布の棄却サンプリングほど単純ではありません。

r - 切断正規分布からの効率的な乱数生成

3 に答える 3

Related

Reference