algorithm - 置換なしのサンプリングのアルゴリズム?

Question

特定のデータのクラスタリングが偶然に発生した可能性をテストしようとしています。これを行う堅牢な方法はモンテカルロシミュレーションです。このシミュレーションでは、データとグループの間の関連付けが無作為に何度も (たとえば 10,000 回) 再割り当てされ、クラスタリングのメトリックを使用して実際のデータとシミュレーションを比較して ap を決定します。価値。

グループ化をデータ要素にマッピングするポインターを使用して、このほとんどの作業を行ったので、データへのポインターをランダムに再割り当てする予定です。質問: 複製データセットですべてのポインターがランダムに再割り当てされるように、置換なしでサンプリングする高速な方法は何ですか?

例 (これらのデータは単純化された例です):

データ (n=12 値) - グループ A: 0.1, 0.2, 0.4 / グループ B: 0.5, 0.6, 0.8 / グループ C: 0.4, 0.5 / グループ D: 0.2, 0.2, 0.3, 0.5

レプリケートデータセットごとに、同じクラスターサイズ (A=3、B=3、C=2、D=4) とデータ値を使用しますが、値をクラスターに再割り当てします。

これを行うには、1 ～ 12 の範囲で乱数を生成し、グループ A の最初の要素を割り当て、次に 1 ～ 11 の範囲で乱数を生成し、グループ A の 2 番目の要素を割り当てる、というようにします。ポインターの再割り当ては高速で、すべてのデータ構造を事前に割り当てますが、置換なしのサンプリングは以前に何度も解決された可能性がある問題のようです。

ロジックまたは疑似コードが優先されます。

score 40 · Accepted Answer

Knuth の本 Seminumeric Algorithms の Algorithm 3.4.2S に基づいた置換なしのサンプリングのコードを次に示します。

void SampleWithoutReplacement
(
    int populationSize,    // size of set sampling from
    int sampleSize,        // size of each sample
    vector<int> & samples  // output, zero-offset indicies to selected items
)
{
    // Use Knuth's variable names
    int& n = sampleSize;
    int& N = populationSize;

    int t = 0; // total input records dealt with
    int m = 0; // number of items selected so far
    double u;

    while (m < n)
    {
        u = GetUniform(); // call a uniform(0,1) random number generator

        if ( (N - t)*u >= n - m )
        {
            t++;
        }
        else
        {
            samples[m] = t;
            t++; m++;
        }
    }
}

Jeffrey Scott Vitter による、「逐次ランダムサンプリングの効率的なアルゴリズム」、ACM Transactions on Mathematical Software、13(1)、1987 年 3 月、58-67 に、より効率的であるがより複雑な方法があります。

score 6 · Accepted Answer

この質問に対する私の回答を参照してくださいO(1) の一意の (繰り返しのない) 乱数? . 同じロジックで、目的を達成する必要があります。

score 2 · Accepted Answer

@John D. Cookの回答に触発されて、Nimで実装を書きました。最初は仕組みがわかりにくかったので、例も含めて幅広くコメントしました。考え方を理解するのに役立つかもしれません。また、変数名を少し変更しました。

iterator uniqueRandomValuesBelow*(N, M: int) =
  ## Returns a total of M unique random values i with 0 <= i < N
  ## These indices can be used to construct e.g. a random sample without replacement
  assert(M <= N)

  var t = 0 # total input records dealt with
  var m = 0 # number of items selected so far

  while (m < M):
    let u = random(1.0) # call a uniform(0,1) random number generator

    # meaning of the following terms:
    # (N - t) is the total number of remaining draws left (initially just N)
    # (M - m) is the number how many of these remaining draw must be positive (initially just M)
    # => Probability for next draw = (M-m) / (N-t)
    #    i.e.: (required positive draws left) / (total draw left)
    #
    # This is implemented by the inequality expression below:
    # - the larger (M-m), the larger the probability of a positive draw
    # - for (N-t) == (M-m), the term on the left is always smaller => we will draw 100%
    # - for (N-t) >> (M-m), we must get a very small u
    #
    # example: (N-t) = 7, (M-m) = 5
    # => we draw the next with prob 5/7
    #    lets assume the draw fails
    # => t += 1 => (N-t) = 6
    # => we draw the next with prob 5/6
    #    lets assume the draw succeeds
    # => t += 1, m += 1 => (N-t) = 5, (M-m) = 4
    # => we draw the next with prob 4/5
    #    lets assume the draw fails
    # => t += 1 => (N-t) = 4
    # => we draw the next with prob 4/4, i.e.,
    #    we will draw with certainty from now on
    #    (in the next steps we get prob 3/3, 2/2, ...)
    if (N - t)*u >= (M - m).toFloat: # this is essentially a draw with P = (M-m) / (N-t)
      # no draw -- happens mainly for (N-t) >> (M-m) and/or high u
      t += 1
    else:
      # draw t -- happens when (M-m) gets large and/or low u
      yield t # this is where we output an index, can be used to sample
      t += 1
      m += 1

# example use
for i in uniqueRandomValuesBelow(100, 5):
  echo i

score 0 · Accepted Answer

置換なしのサンプリングの別のアルゴリズムについては、こちらで説明しています。

これは、John D. Cook の回答と Knuth の回答に似ていますが、仮説が異なります。母集団のサイズは不明ですが、サンプルはメモリに収まる可能性があります。これは「クヌースのアルゴリズム S」と呼ばれます。

ロゼッタコードの記事を引用：

最初の n 個の項目が利用可能になったときにサンプルとして選択します。

i > n の i 番目のアイテムについて、n/i の確率でそれを保持します。この機会に失敗した場合、サンプルは同じままです。そうでない場合は、サンプルの以前に選択された n 個のアイテムの 1 つをランダムに (1/n) 置き換えます。

後続の項目については #2 を繰り返します。

algorithm - 置換なしのサンプリングのアルゴリズム?

7 に答える 7

Related

Reference