python - 整数係数のリストとその倍長整数表現の間の変換を最適化する

Question

私の多項式の実装を最適化しようとしています。n特に、係数を法とする多項式( かもしれません>2^64) と形式x^r - 1(rである) の多項式を法とする多項式を扱ってい< 2^64ます。現時点では、係数を整数のリスト (*) として表し、すべての基本的な操作を最も簡単な方法で実装しました。

累乗と乗算をできるだけ高速にしたいと考えており、これを取得するために、すでにさまざまなアプローチを試しました。私の現在のアプローチは、係数のリストを巨大な整数に変換し、整数を乗算して係数をアンパックすることです。

問題は、梱包と開梱に時間がかかることです。

では、「パック/アンパック」機能を改善する方法はありますか?

def _coefs_to_long(coefs, window):
    '''Given a sequence of coefficients *coefs* and the *window* size return a
    long-integer representation of these coefficients.
    '''

    res = 0
    adder = 0
    for k in coefs:
        res += k << adder
        adder += window
    return res
    #for k in reversed(coefs): res = (res << window) + k is slower


def _long_to_coefs(long_repr, window, n):
    '''Given a long-integer representing coefficients of size *window*, return
    the list of coefficients modulo *n*.
    '''

    mask = 2**window - 1
    coefs = [0] * (long_repr.bit_length() // window + 1)
    for i in xrange(len(coefs)):
        coefs[i] = (long_repr & mask) % n
        long_repr >>= window

    # assure that the returned list is never empty, and hasn't got an extra 0.
    if not coefs:
        coefs.append(0)
    elif not coefs[-1] and len(coefs) > 1:
        coefs.pop()

    return coefs

を選択していないことに注意してくださいn。これはユーザーからの入力であり、私のプログラムは (AKS テストを使用して) その素数性を証明したいので、因数分解できません。

(*) 私はいくつかのアプローチを試しました:

numpyリストの代わりに配列を使用し、を使用して乗算しnumpy.convolveます。高速ですn < 2^64が、非常に遅いn > 2^64[また、外部ライブラリの使用を避けたい]
を使用してscipy.fftconvolveいます。にはまったく機能しませんn > 2^64。
最初から係数を整数として表します (毎回変換せずに)。問題は、整数を係数のリストに変換せずに操作を行う簡単な方法がわからないことですmod x^r -1(これは、この表現を使用する理由を無効にします)。

score 2 · Accepted Answer

あなたが学ぶためにこれをしているのでない限り、なぜ車輪を再発明するのですか？別のアプローチは、Pythonラッパーを他の多項式ライブラリまたはプログラムに書き込むことです（そのようなラッパーがまだ存在しない場合）。

PARI/GPをお試しください。驚くほど速いです。最近、カスタムCコードを作成しましたが、作成に2日かかり、2行のPARI/GPスクリプトよりもわずか3倍高速であることがわかりました。PARIを呼び出すPythonコードは、Pythonだけで実装するものよりも高速になると思います。PythonからPARIを呼び出すためのモジュールもあります： https ：//code.google.com/p/pari-python/

score 2 · Accepted Answer

任意精度の整数多項式を numpy 配列のリストとして直接実装するのはどうですか?

説明させてください: あなたの多項式が Σ _p A _p X ^pであるとしましょう。大整数 A _{pを A}_p = Σ _k A _p,k 2 ^{64 k}として表すことができる場合、k^番目の numpy 配列には、位置 pに 64 ビットの int A _{p,kが含まれます。}

問題の構造に応じて、密配列または疎配列を選択できます。

加算演算とスカラー演算の実装は、同じ演算の bignum 実装をベクトル化するだけの問題です。

乗算は次のように処理できます: AB = Σ _p,k,p',k' A _p,k B _p',k' 2 ^64(k+k') X ^p+p' . したがって、密な配列を使用した単純な実装では、 orへの₆₄ (n) ^{2 回}の呼び出しが発生する可能性があります。numpy.convolescipy.fftconvolve

モジュロ演算は、左側の項の線形関数であり、右側の項の係数が小さいため、実装が容易です。

編集ここにいくつかの説明があります

多項式を任意精度の数値のリスト (それ自体が 64 ビットの「数字」のリストとして表される) として表す代わりに、次のように表現を転置します。

あなたの多項式は配列のリストとして表されます
k^番目の配列には、各係数の k^番目の「桁」が含まれます

少数の係数のみが非常に大きい場合、配列にはほとんど 0 が含まれるため、スパース配列を使用する価値があります。

^p番目の係数の k^番目の桁をA _{p,kと呼びます。}

大きな整数表現との類推に注意してください。ここで、大きな整数は次のように表されます。

x = Σ _k x _k 2 ^{64 k}

多項式 A は次のように表されます

A = Σ _k A _k 2 ^{64 k} A _k = Σ _k A _p,k X ^p

加算を実装するには、配列のリストが単純な数字のリストであるふりをして、大きな整数に対して通常どおり加算を実装します (if then条件をで置き換えることに注意してnumpy.whereください)。

_{乗算を実装するには、log 64} (n) ²の多項式乗算を行う必要があることがわかります。

係数にモジュロ演算を実装することは、モジュロ演算を大きな整数に変換する単純なケースです。

小さな係数の多項式でモジュロを取るには、この演算の線形性を使用します。

A mod (X ^r - 1) = (Σ _k A _k 2 ^{64 k} ) mod (X ^r - 1)

= Σ _k 2 ^{64 k} (A _k mod (X ^r - 1))

score 2 · Accepted Answer

You could try using residual number systems to represent the coefficients of your polynomial. You would also split up your coefficients into smaller integers as you do now, but you don't need to convert them back to a huge integer to do multiplications or other operations. This should not require much reprogramming effort.

The basic principle of residual number systems is the unique representation of numbers using modular arithmetic. The whole theory surrounding RNS allows you to do your operations on the small coefficients.

edit: a quick example:

Suppose you represent your large coefficients in an RNS with moduli 11 and 13. Your coefficients would all consist of 2 small integers (<11 and <13) that can be combined to the original (large) integer.

Suppose your polynomial is originally 33x²+18x+44. In RNS, the coefficients would respectively be (33 mod 11, 33 mod 13),(18 mod 11,18 mod 13) and (44 mod 11, 44 mod 13)=>(0,7),(7,5) and (0,5).

Multiplying your polynomial with a constant can then be done by multiplying each small coefficient with that constant and do modulo on it.

Say you multiply by 3, your coefficients will become (0,21 mod 13)=(0,8), (21 mod 11,15 mod 13)=(10,2) and (0 mod 11,15 mod 13)=(0,2). There has been no need to convert the coefficients back to their large integer.

To check if our multiplication has worked, we can convert the new coefficients back to their large representation. This requires 'solving' each set of coefficients as a modular system. For the first coefficients (0,8) we would need to solve x mod 11=0 and x mod 13 = 8. This should not be too hard to implement. In this example you can see that x=99 is a valid solution (modulo 13*11)

We then get 99x²+54x+132, the correct multiplied polynomial. Multiplying with other polynomials is similar (but require you to multiply the coefficients with each other in a pairwise manner). The same goes for addition.

For your use case, you could choose your n based on the number of coefficients you want or the their size.

score 1 · Accepted Answer

誰かが私をさらに改善するのを手伝ってくれることを願っていますが、コンバージョンを最適化する方法を見つけました。うまくいけば、他の賢いアイデアを見つけることができます。

基本的に、これらの関数の問題点は、整数をパックするとき、またはアンパックするときに、ある種の2次メモリ割り当ての動作をすることです。（この種の動作の他の例については、Guido van Rossumのこの投稿を参照してください）。

これに気付いた後、分割統治の原則を試してみることにしました。いくつかの結果が得られました。配列を2つの部分に分割し、それらを別々に変換して、最終的に結果を結合します（後でf5、Rossumの投稿[編集：それほど速くはないようです]にあるような反復バージョンを使用しようとします）。

変更された機能：

def _coefs_to_long(coefs, window):
    """Given a sequence of coefficients *coefs* and the *window* size return a
    long-integer representation of these coefficients.
    """

    length = len(coefs)
    if length < 100:
        res = 0
        adder = 0
        for k in coefs:
            res += k << adder
            adder += window
        return res
    else:
        half_index = length // 2
        big_window = window * half_index
        low = _coefs_to_long(coefs[:half_index], window)
        high = _coefs_to_long(coefs[half_index:], window)
        return low + (high << big_window)


def _long_to_coefs(long_repr, window, n):
    """Given a long-integer representing coefficients of size *window*, return
    the list of coefficients modulo *n*.
    """

    win_length = long_repr.bit_length() // window
    if win_length < 256:
        mask = 2**window - 1
        coefs = [0] * (long_repr.bit_length() // window + 1)
        for i in xrange(len(coefs)):
            coefs[i] = (long_repr & mask) % n
            long_repr >>= window

        # assure that the returned list is never empty, and hasn't got an extra 0.
        if not coefs:
            coefs.append(0)
        elif not coefs[-1] and len(coefs) > 1:
            coefs.pop()

        return coefs
    else:
        half_len = win_length // 2
        low = long_repr & (((2**window) ** half_len) - 1)
        high = long_repr >> (window * half_len)
        return _long_to_coefs(low, window, n) + _long_to_coefs(high, window, n)

そして結果：

>>> import timeit
>>> def coefs_to_long2(coefs, window):
...     if len(coefs) < 100:
...         return coefs_to_long(coefs, window)
...     else:
...         half_index = len(coefs) // 2
...         big_window = window * half_index
...         least = coefs_to_long2(coefs[:half_index], window) 
...         up = coefs_to_long2(coefs[half_index:], window)
...         return least + (up << big_window)
... 
>>> coefs = [1, 2, 3, 1024, 256] * 567
>>> # original function
>>> timeit.timeit('coefs_to_long(coefs, 11)', 'from __main__ import coefs_to_long, coefs',
...               number=1000)/1000
0.003283214092254639
>>> timeit.timeit('coefs_to_long2(coefs, 11)', 'from __main__ import coefs_to_long2, coefs',
...               number=1000)/1000
0.0007998988628387451
>>> 0.003283214092254639 / _
4.104536516782767
>>> coefs = [2**64, 2**31, 10, 107] * 567
>>> timeit.timeit('coefs_to_long(coefs, 66)', 'from __main__ import coefs_to_long, coefs',...               number=1000)/1000

0.009775240898132325
>>> 
>>> timeit.timeit('coefs_to_long2(coefs, 66)', 'from __main__ import coefs_to_long2, coefs',
...               number=1000)/1000
0.0012255229949951173
>>> 
>>> 0.009775240898132325 / _
7.97638309362875

ご覧のとおり、このバージョンでは、変換までの速度がかなり4速く8なります（入力が大きいほど、速度が速くなります）。2番目の関数でも同様の結果が得られます。

>>> import timeit
>>> def long_to_coefs2(long_repr, window, n):
...     win_length = long_repr.bit_length() // window
...     if win_length < 256:
...         return long_to_coefs(long_repr, window, n)
...     else:
...         half_len = win_length // 2
...         least = long_repr & (((2**window) ** half_len) - 1)
...         up = long_repr >> (window * half_len)
...         return long_to_coefs2(least, window, n) + long_to_coefs2(up, window, n)
... 
>>> long_repr = coefs_to_long([1,2,3,1024,512, 0, 3] * 456, 13)
>>> # original function
>>> timeit.timeit('long_to_coefs(long_repr, 13, 1025)', 'from __main__ import long_to_coefs, long_repr', number=1000)/1000
0.005114212036132813
>>> timeit.timeit('long_to_coefs2(long_repr, 13, 1025)', 'from __main__ import long_to_coefs2, long_repr', number=1000)/1000
0.001701267957687378
>>> 0.005114212036132813 / _
3.006117885794327
>>> long_repr = coefs_to_long([1,2**33,3**17,1024,512, 0, 3] * 456, 40)
>>> timeit.timeit('long_to_coefs(long_repr, 13, 1025)', 'from __main__ import long_to_coefs, long_repr', number=1000)/1000
0.04037192392349243
>>> timeit.timeit('long_to_coefs2(long_repr, 13, 1025)', 'from __main__ import long_to_coefs2, long_repr', number=1000)/1000
0.005722791910171509
>>> 0.04037192392349243 / _
7.0545853417694

最初の関数で開始インデックスと終了インデックスを渡し、スライスを回避することで、より多くのメモリの再割り当てを回避しようとしましたが、これにより、入力が小さい場合は関数の速度が大幅に低下し、実際の場合は少し遅くなります。入力。あまり良い結果が得られないとは思いますが、それらを混ぜてみることができるかもしれません。

私は前の期間に私の質問を編集したので、何人かの人々は私が最近必要としたものとは異なる目的で私にいくつかのアドバイスをくれました。コメントと回答でさまざまな情報源から指摘された結果を少し明確にして、高速多項式やAKSテストの実装を検討している他の人々に役立つようにすることが重要だと思います。

JF Sebastianが指摘したように、AKSアルゴリズムには多くの改善が加えられているため、古いバージョンのアルゴリズムを実装しようとすると、プログラムが非常に遅くなります。これは、AKSの適切な実装がすでにある場合は、多項式の改善を高速化できるという事実を排除するものではありません。
小さい係数n（ワードサイズの数値を読み取る）に関心があり、外部の依存関係を気にしない場合は、乗算numpyを使用するnumpy.convolveかscipy.fftconvolve、乗算を使用します。それはあなたが書くことができる何よりもはるかに速くなります。残念ながら、nワードサイズでない場合はまったく使用できずscipy.fftconvolve、numpy.convolve地獄のように遅くなります。
モジュロ演算（係数と多項式）を実行する必要がない場合は、素晴らしい結果を約束していなくても（haroldが指摘しているように）、おそらくZBDDを使用することをお勧めします[本当におもしろいので、湊の論文を読んでください]。
係数に対してモジュロ演算を実行する必要がない場合は、Originで述べられているように、おそらくRNS表現を使用することをお勧めします。numpy次に、複数のアレイを組み合わせて効率的に動作させることができます。
係数モジュロが大きい多項式の純粋なPython実装が必要な場合はn、私のソリューションが最速のようです。私はPythonで係数の配列間にfft乗算を実装しようとしませんでしたが（これはより速いかもしれません）。

python - 整数係数のリストとその倍長整数表現の間の変換を最適化する

4 に答える 4

Related

Reference