python - 非常に大きな数のPythonでの二項検定

Question

10000のオーダーの「n」個の計算を可能にするPythonで二項検定を実行する必要があります。

scipy.misc.combを使用して簡単なbinomial_test関数を実装しましたが、階乗または組み合わせ自体を計算しているときに表現可能な最大数に達するため、n=1000付近にかなり制限されていると思います。これが私の関数です：

from scipy.misc import comb
def binomial_test(n, k):
    """Calculate binomial probability
    """
    p = comb(n, k) * 0.5**k * 0.5**(n-k)
    return p

その二項確率を計算するために、ネイティブのpython（またはnumpy、scipy ...）関数をどのように使用できますか？可能であれば、scipy0.7.2互換のコードが必要です。

どうもありがとう！

score 10 · Accepted Answer

このコメントを追加するために編集：Daniel Stutzbachが言及しているように、「二項検定」はおそらく元の投稿者が求めていたものではないことに注意してください（彼はこの表現を使用しましたが）。彼は二項分布の確率密度関数を求めているようですが、これは私が以下で提案しているものではありません。

scipy.stats.binom_testを試しましたか？

rbp@apfelstrudel ~$ python
Python 2.6.2 (r262:71600, Apr 16 2009, 09:17:39) 
[GCC 4.0.1 (Apple Computer, Inc. build 5250)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from scipy import stats
>>> print stats.binom_test.__doc__

    Perform a test that the probability of success is p.

    This is an exact, two-sided test of the null hypothesis
    that the probability of success in a Bernoulli experiment
    is `p`.

    Parameters
    ----------
    x : integer or array_like
        the number of successes, or if x has length 2, it is the
        number of successes and the number of failures.
    n : integer
        the number of trials.  This is ignored if x gives both the
        number of successes and failures
    p : float, optional
        The hypothesized probability of success.  0 <= p <= 1. The
        default value is p = 0.5

    Returns
    -------
    p-value : float
        The p-value of the hypothesis test

    References
    ----------
    .. [1] http://en.wikipedia.org/wiki/Binomial_test


>>> stats.binom_test(500, 10000)
4.9406564584124654e-324

ドキュメントリンクを追加するための小さな編集：http：//docs.scipy.org/doc/scipy/reference/generated/scipy.stats.binom_test.html#scipy.stats.binom_test

ところで：scipy 0.7.2と、現在の0.8devで動作します。

score 6 · Accepted Answer

のように見えるソリューションcomb(n, k) * 0.5**k * 0.5**(n-k)は、大規模では機能しませんn。ほとんどの（すべて？）プラットフォームでは、Pythonfloatが格納できる最小値は約2**-1022です。が大きいn-kまたは大きいk場合、右側は0に丸められます。同様に、comb（n、k）は非常に大きくなり、浮動小数点数に収まらない可能性があります。

より堅牢なアプローチは、累積分布関数の2つの連続する点の差として確率密度関数を計算することです。これは、正規化された不完全ベータ関数を使用して計算できます（SciPyの「特殊関数」パッケージを参照）。数学的に：

pdf(p, n, k) = cdf(p, n, k) - cdf(p, n, k-1)

もう1つのオプションは、正規近似を使用することです。これは、大きい場合に非常に正確ですn。速度が懸念される場合は、これがおそらく進むべき道です。

from math import *

def normal_pdf(x, m, v):
    return 1.0/sqrt(2*pi*v) * exp(-(x-m)**2/(2*v))

def binomial_pdf(p, n, k):
    if n < 100:
        return comb(n, k) * p**k * p**(n-k)  # Fall back to your current method
    return normal_pdf(k, n*p, n*p*(1.0-p))

私はコードをテストしていませんが、それはあなたに一般的な考えを与えるはずです。

score 3 · Accepted Answer

GMPYは、拡張精度の浮動小数点計算もサポートしています。例えば：

>>> from gmpy import *
>>>
>>> def f(n,k,p,prec=256):
...     return mpf(comb(n,k),prec) * mpf(p,prec)**k * mpf(1-p,prec)**(n-k)
...
>>> print(f(1000,500,0.5))
0.0252250181783608019068416887621024545529410193921696384762532089115753731615931
>>>

256ビットの浮動小数点精度を指定しました。ちなみに、SourceForgeのバージョンはかなり時代遅れです。現在のバージョンはcode.google.comで管理されており、Python3.xをサポートしています。（免責事項：私はgmpyの現在のメンテナーです。）

casevh

score 1 · Accepted Answer

GNU Multi-Precisionパッケージ（gmpy）を調べます。これにより、任意精度の計算を実行できます。おそらく次のように実行できます。

comb(n, k, exact=1)/2**k/2**(n-k)

しかし、gmpyの長整数を使用します。

実際、正確な整数計算を使用すると、組み合わせ部分でn=10000に簡単に到達できます。このためには、以下を使用する必要があります。

comb(n, k, exact=1)

comb(n, k)オーバーフローする浮動小数点近似の代わりに。

ただし、元のポスターに記載されているように、返される（長い）整数は長すぎて浮動小数点数で乗算できない場合があります。

さらに、すぐに別の問題が発生します。0.5**1000=9.3…e-302はすでにフロートアンダーフローに非常に近い…

要約すると、すべてkのに対して正確な結果が本当に必要な場合n~10,000は、元の投稿の数式とは異なるアプローチを使用する必要があります。これには、倍精度浮動小数点演算の制限があります。上記のようにgmpyを使用することは、解決策になる可能性があります（テストされていません！）。

score 0 · Accepted Answer

特にPythonソリューションではありませんが、小さな小数のエラーを処理できる場合は、スターリングのnの近似を使用してみてください。

comb（n、k）= n！/（k！*（nk）！）、ここでn！大きいnの場合、およそsqrt（2 * Pi n）（n / e）^nです。

n> 1000の場合、小数誤差は非常に小さいはずです。

nが大きい確率計算では、中間結果の対数を使用します。

log p = log（comb（n、k））-n * log（2）

p = exp（log（p））

score -1 · Accepted Answer

#  This imports the array function form numpy

from numpy import array

    # the following defines the factorial function to be used in the binomial commands/
# n+1 is used in the range to include the nth term

def factorial (n):
    f=1
    for x in range(1,n+1):
        f=f*(x)
    return f

# The follwong calculates the binomial coefficients for given values of n & k 
def binomial (n,k):
    b=1
    b=(factorial(n)/(factorial(k)*factorial(n-k)))
    return int(b)

# the following lines define the pascal triangle , and print it out for 20 rows./
# in order to include nth term, the n +1 term needs to be in the range. The commands/
# append the next binomial coeficiant to a raw first and then append rows to the triangle/
# and prints a 20 row size pascal triangle
def pascal(T):
    triangle=[]
    for n in range(T):
        r=[]
        for k in range(n+1):
            r.append(binomial(n,k))
        triangle.append(r)
    return triangle

for r in pascal(20):
    print((r))

python - 非常に大きな数のPythonでの二項検定

6 に答える 6

Related

Reference