algorithm - 指定された結果の確率を持つ値のランダムサンプル

Question

'a '、'b'、'c'、'd' の4 つの記号があるとします。また、関数出力に現れるこれらのシンボルの 4 つの確率(合計は1に等しい) - P1、P2、P3、P4があります。これらのシンボルの 3 つのランダムサンプルを生成する関数をどのように実装しますか? 結果のシンボルは、指定された確率でその中に存在します。

例: 'a'、'b'、'c'および'd'の確率は、それぞれ9/30、8/30、7/30および6/30です。この関数は、これらの 4 つのシンボルのうち任意の 3 つのさまざまなランダムサンプルを出力します: 'abc'、'dca'、'bad'など。この関数を何度も実行し、出力で各シンボルが出現する回数を数えます。最後に、'a'に保存されたカウント値をシンボル出力の合計量で割った値は、 'b' の9/30に収束するはずです。、「c」の場合は7/30まで、「d」の場合は6/30まで。

たとえば、関数は 10 個の出力を生成します。

adc
dab
bca
dab
dba
cab
dcb
acd
cab
abc

30 個のシンボルのうち、 'a 'が 9 個、 'b'が 8 個、'c'が 7個、 'd'が 6個含まれています。もちろん、サンプル数がはるかに多い場合にのみ値が収束するため、これは理想的な例ですが、うまくいけばアイデアを伝えるはずです。

明らかに、これはすべて、どちらの確率も 1/3 より大きくない場合にのみ可能です。これは、各サンプル出力には常に 3 つの異なるシンボルが含まれるためです。指定された値を満たすことが不可能な場合、関数が無限ループに入ったり、その他の方法で異常な動作をしたりしても問題ありません。

注: 関数は明らかに RNG を使用する必要がありますが、それ以外の場合はステートレスにする必要があります。RNG 状態を除いて、それぞれの新しい呼び出しは、以前のものから独立している必要があります。

EDIT : 説明では 4 つの値から 3 つを選択することが言及されていますが、理想的には、アルゴリズムは任意のサンプルサイズに対応できる必要があります。

score 1 · Accepted Answer

これはかなり興味深い問題だと思います。一般的な解はわかりませんが、サイズ n-1 のサンプルの場合 (解がある場合) は簡単に解くことができます。正確に n 個の可能なサンプルがあり、それぞれが 1 個の欠落に対応するからです。要素の。

OP のように、サイズ 4 のユニバースからサイズ 3 のサンプルで F a = 9/30、F b = 8/30、F c = 7/30、F d = 6/30を_求めて_いると_します_。特定のオブジェクトを含まないサンプルを選択することで、これらの各周波数をサンプルの周波数に直接変換できます。たとえば、選択したオブジェクトの 9/30 をにします。サンプルに複数のシンボルを持つことはできず、サンプルには常に 3 つのシンボルがあります。したがって、サンプルの 9/10 にはが含まれている必要があり、1/10 には含まれていません。ただし、 :を含まない可能性のあるサンプルは 1 つだけです。したがって、サンプルの 10% はでなければなりません。同様に、20% は; 30%aaaaabcdbcdacdabdと 40% abc。(または、より一般的には、F _ā = 1 - (n-1)F _aここで、F _āはを含まない (一意の) サンプルの頻度ですa)

この観察と、ユニークなサンプルを生成する古典的な方法の 1 つを組み合わせることで、一般的な問題を解決できると思わずにはいられません。しかし、私はその解決策を持っていません。価値があるのは、私が考えているアルゴリズムは次のとおりです。

To select a random sample of size k out of a universe U of n objects:
1) Set needed = k; available = n.
2) For each element in U, select a random number in the range [0, 1).
3) If the random number is less than k/n:
     3a) Add the element to the sample.
     3b) Decrement needed by 1. If it reaches 0, we're finished.
4) Decrement available, and continue with the next element in U.

したがって、ステップ3でしきい値を変更することにより、要素の周波数を操作して、対応する要素の目的の周波数の関数にすることができるはずだというのが私の考えです。

score 1 · Accepted Answer

あなたの問題は過小評価されています。

p(abc)、p(abd)、p(acd) など xtc を許可する 3 文字の各文字列に確率を割り当てると、一連の方程式を生成できます。

eqn1: p(abc) + p(abd) + ... others with a "a" ... = p1
  ...
  ...
eqn2: p(abd) + p(acd) + ... others with a "d" ... = p4

これには方程式よりも多くの未知数があり、それを解く方法はたくさんあります。解決策が見つかったら、どのような方法を選択しても (私の場合はシンプレックスアルゴリズムを使用します)、@alestanis が説明するルーレット方法を使用して、各文字列の確率からサンプリングします。

from numpy import *

# using cvxopt-1.1.5
from cvxopt import matrix, solvers 

###########################
# Functions to do some parts

# function to find all valid outputs
def perms(alphabet, length):
    if length == 0:
        yield ""
        return
    for i in range(len(alphabet)):
        val1 = alphabet[i]
        for val2 in perms(alphabet[:i]+alphabet[i+1:], length-1):
            yield val1 + val2


# roulette sampler
def roulette_sampler(values, probs):
    # Create cumulative prob distro
    probs_cum = [sum(probs[:i+1]) for i in range(n_strings)]
    def fun():
        r = random.rand()
        for p,s in zip(probs_cum, values):
            if r < p:
                return s
        # in case of rounding error
        return values[-1]
    return fun


############################
#    Main Part



# create list of all valid strings

alphabet = "abcd"
string_length = 3
alpha_probs = [string_length*x/30. for x in range(9,5,-1)]

# show probs
for a,p in zip(alphabet, alpha_probs):
    print "p("+a+") =",p




# all valid outputs for this particular case
strings = [perm for perm in perms(alphabet, string_length)]
n_strings = len(strings)

# constraints from probabilities p(abc) + p(abd) ... = p(a)
contains = array([[1. if s.find(a) >= 0 else 0. for a in alphabet] for s in strings])
#both = concatenate((contains,wons), axis=1).T # hacky, but whatever
#A = matrix(both)
#b = matrix(alpha_probs + [1.])
A = matrix(contains.T)
b = matrix(alpha_probs)

#also need to constrain to [0,1]
wons = array([[1. for s in strings]])
G = matrix(concatenate((eye(n_strings),wons,-eye(n_strings),-wons)))
h = matrix(concatenate((ones(n_strings+1),zeros(n_strings+1))))

## target matricies for approx KL divergence
# uniform prob over valid outputs
u = 1./len(strings)
P = matrix(eye(n_strings))
q = -0.5*u*matrix(ones(n_strings))
# will minimise p^2 - pq for each p val equally


# Do convex optimisation
sol = solvers.qp(P,q,G,h,A,b)
probs = array(sol['x'])

# Print ouput
for s,p in zip(strings,probs):
    print "p("+s+") =",p
checkprobs = [0. for char in alphabet]
for a,i in zip(alphabet, range(len(alphabet))):
    for s,p in zip(strings,probs):
        if s.find(a) > -1:
            checkprobs[i] += p
    print "p("+a+") =",checkprobs[i]
print "total =",sum(probs)

# Create the sampling function
rndstring = roulette_sampler(strings, probs)


###################
# Verify

print "sampling..."
test_n = 1000
output = [rndstring() for i in xrange(test_n)]

# find which one it is
sampled_freqs = []
for char in alphabet:
    n = 0
    for val in output:
        if val.find(char) > -1:
            n += 1
    sampled_freqs += [n]

print "plotting histogram..."
import matplotlib.pyplot as plt
plt.bar(range(0,len(alphabet)),array(sampled_freqs)/float(test_n), width=0.5)
plt.show()

編集: Python コード

score 1 · Accepted Answer

単語の長さは常に記号の数より 1 少ないと仮定すると、次の C# コードがその役割を果たします。

using System;
using System.Collections.Generic;
using System.Linq;
using MathNet.Numerics.Distributions;

namespace RandomSymbols
{
    class Program
    {
        static void Main(string[] args)
        {
            // Sample case:  Four symbols with the following distribution, and 10000 trials
            double[] distribution = { 9.0/30, 8.0/30, 7.0/30, 6.0/30 };
            int trials = 10000;

            // Create an array containing all of the symbols
            char[] symbols = Enumerable.Range('a', distribution.Length).Select(s => (char)s).ToArray();

            // We're assuming that the word length is always one less than the number of symbols
            int wordLength = symbols.Length - 1;

            // Calculate the probability of each symbol being excluded from a given word
            double[] excludeDistribution = Array.ConvertAll(distribution, p => 1 - p * wordLength);

            // Create a random variable using the MathNet.Numerics library
            var randVar = new Categorical(excludeDistribution);
            var random = new Random();
            randVar.RandomSource = random;

            // We'll store all of the words in an array
            string[] words = new string[trials];

            for (int t = 0; t < trials; t++)
            {
                // Start with a word containing all of the symbols
                var word = new List<char>(symbols);

                // Remove one of the symbols
                word.RemoveAt(randVar.Sample());

                // Randomly permute the remainder
                for (int i = 0; i < wordLength; i++)
                {
                    int swapIndex = random.Next(wordLength);
                    char temp = word[swapIndex];
                    word[swapIndex] = word[i];
                    word[i] = temp;
                }

                // Store the word
                words[t] = new string(word.ToArray());
            }

            // Display words
            Array.ForEach(words, w => Console.WriteLine(w));

            // Display histogram
            Array.ForEach(symbols, s => Console.WriteLine("{0}: {1}", s, words.Count(w => w.Contains(s))));
        }

    }
}

更新: 以下は、rici が概説したメソッドの C 実装です。トリッキーな部分は、彼が言及しているしきい値を計算することです。これは私が再帰で行いました。

#include <stdio.h>
#include <stdlib.h>
#include <time.h>

// ****** Change the following for different symbol distributions, word lengths, and number of trials ******
double targetFreqs[] = {10.0/43, 9.0/43, 8.0/43, 7.0/43, 6.0/43, 2.0/43, 1.0/43 };
const int WORDLENGTH = 4;
const int TRIALS = 1000000;
// *********************************************************************************************************

const int SYMBOLCOUNT = sizeof(targetFreqs) / sizeof(double);
double inclusionProbs[SYMBOLCOUNT];

double probLeftToIncludeTable[SYMBOLCOUNT][SYMBOLCOUNT];

// Calculates the probability that there will be n symbols left to be included when we get to the ith symbol.
double probLeftToInclude(int i, int n)
{
    if (probLeftToIncludeTable[i][n] == -1)
    {
        // If this is the first symbol, then the number of symbols left to be included is simply the word length.
        if (i == 0)
        {
            probLeftToIncludeTable[i][n] = (n == WORDLENGTH ? 1.0 : 0.0);
        }
        else
        {
            // Calculate the probability based on the previous symbol's probabilities.
            // To get to a point where there are n symbols left to be included, either there were n+1 symbols left
            // when we were considering that previous symbol and we included it, leaving n,
            // or there were n symbols left and we didn't included it, also leaving n.
            // We have to take into account that the previous symbol may have been manditorily included.
             probLeftToIncludeTable[i][n] = probLeftToInclude(i-1, n+1) * (n == SYMBOLCOUNT-i ? 1.0 : inclusionProbs[i-1])
                + probLeftToInclude(i-1, n) * (n == 0 ? 1.0 : 1 - inclusionProbs[i-1]);
        }
    }
    return probLeftToIncludeTable[i][n];
}

// Calculates the probability that the ith symbol won't *have* to be included or *have* to be excluded.
double probInclusionIsOptional(int i)
{
    // The probability that inclusion is optional is equal to 1.0
    // minus the probability that none of the remaining symbols can be included
    // minus the probability that all of the remaining symbols must be included.
    return 1.0 - probLeftToInclude(i, 0) - probLeftToInclude(i, SYMBOLCOUNT - i);
}

// Calculates the probability with which the ith symbol should be included, assuming that
// it doesn't *have* to be included or *have* to be excluded.
double inclusionProb(int i)
{
    // The following is derived by simple algebra:
    // Unconditional probability = (1.0 * probability that it must be included) + (inclusionProb * probability that inclusion is optional)
    // therefore...
    // inclusionProb = (Unconditional probability - probability that it must be included) / probability that inclusion is optional
    return (targetFreqs[i]*WORDLENGTH - probLeftToInclude(i, SYMBOLCOUNT - i)) / probInclusionIsOptional(i);
}

int main(int argc, char* argv[])
{
    srand(time(NULL));

    // Initialize inclusion probabilities
    for (int i=0; i<SYMBOLCOUNT; i++)
        for (int j=0; j<SYMBOLCOUNT; j++)
            probLeftToIncludeTable[i][j] = -1.0;

    // Calculate inclusion probabilities
    for (int i=0; i<SYMBOLCOUNT; i++)
    {
        inclusionProbs[i] = inclusionProb(i);
    }

    // Histogram
    int histogram[SYMBOLCOUNT];
    for (int i=0; i<SYMBOLCOUNT; i++)
    {
        histogram[i] = 0;
    }

    // Scratchpad for building our words
    char word[WORDLENGTH+1];
    word[WORDLENGTH] = '\0';

    // Run trials
    for (int t=0; t<TRIALS; t++)
    {
        int included = 0;

        // Build the word by including or excluding symbols according to the problem constraints
        // and the probabilities in inclusionProbs[].
        for (int i=0; i<SYMBOLCOUNT && included<WORDLENGTH; i++)
        {
            if (SYMBOLCOUNT - i == WORDLENGTH - included // if we have to include this symbol
                || (double)rand()/(double)RAND_MAX < inclusionProbs[i]) // or if we get a lucky roll of the dice
            {
                word[included++] = 'a' + i;
                histogram[i]++;
            }
        }

        // Randomly permute the word
        for (int i=0; i<WORDLENGTH; i++)
        {
            int swapee = rand() % WORDLENGTH;
            char temp = word[swapee];
            word[swapee] = word[i];
            word[i] = temp;
        }

        // Uncomment the following to show each word
        // printf("%s\r\n", word);
    }

    // Show the histogram
    for (int i=0; i<SYMBOLCOUNT; i++)
    {
        printf("%c: target=%d, actual=%d\r\n", 'a'+i, (int)(targetFreqs[i]*WORDLENGTH*TRIALS), histogram[i]);
    }

    return 0;
}

score 0 · Accepted Answer

これを行うには、確率の累積合計を格納する一時配列を使用する必要があります。

あなたの例では、確率はそれぞれ 9/30、8/30、7/30、6/30 です。次に、配列が必要です。

values = {'a', 'b', 'c', 'd'}
proba = {9/30, 17/30, 24/30, 1}

次に、乱数を選択しrて次の[0, 1]ようにします。

char chooseRandom() {
    int i = 0;
    while (r > proba[i])
        ++i; 

    return values[i];
}

algorithm - 指定された結果の確率を持つ値のランダム サンプル

4 に答える 4

Related

Reference

algorithm - 指定された結果の確率を持つ値のランダムサンプル