python - カウントではなく値に基づくウィンドウを使用したパンダローリング計算

Question

rolling_*のさまざまな機能のようなことを行う方法を探していpandasますが、ローリング計算のウィンドウを、値の範囲 (たとえば、DataFrame の列の値の範囲) によって定義したいと考えています。ウィンドウ内の行数。

例として、次のデータがあるとします。

>>> print d
   RollBasis  ToRoll
0          1       1
1          1       4
2          1      -5
3          2       2
4          3      -4
5          5      -2
6          8       0
7         10     -13
8         12      -2
9         13      -5

のようなことrolling_sum(d, 5)をすると、各ウィンドウに 5 行が含まれるローリングサムが得られます。しかし、私が欲しいのは、各ウィンドウにの特定の範囲の値が含まれるローリング合計ですRollBasis。つまり、次のようなことができてd.roll_by(sum, 'RollBasis', 5)、最初のウィンドウにRollBasis1 から 5 までのすべての行が含まれ、2 番目のウィンドウにRollBasis2 から 6 までのすべての行が含まれ、3 番目のウィンドウに次のような結果が得られます。ウィンドウには、3 ～ 7 などのすべての行が含まれますRollBasis。ウィンドウの行数は同じではありませんがRollBasis、各ウィンドウで選択された値の範囲は同じになります。したがって、出力は次のようになります。

>>> d.roll_by(sum, 'RollBasis', 5)
    1    -4    # sum of elements with 1 <= Rollbasis <= 5
    2    -4    # sum of elements with 2 <= Rollbasis <= 6
    3    -6    # sum of elements with 3 <= Rollbasis <= 7
    4    -2    # sum of elements with 4 <= Rollbasis <= 8
    # etc.

は常に互いに素なグループを生成するgroupbyため、でこれを行うことはできません。groupbyウィンドウは常に値ではなく行数でロールするため、ローリング関数ではできません。では、どうすればよいのでしょうか。

score 19 · Accepted Answer

これはあなたが望むことだと思います：

In [1]: df
Out[1]:
   RollBasis  ToRoll
0          1       1
1          1       4
2          1      -5
3          2       2
4          3      -4
5          5      -2
6          8       0
7         10     -13
8         12      -2
9         13      -5

In [2]: def f(x):
   ...:     ser = df.ToRoll[(df.RollBasis >= x) & (df.RollBasis < x+5)]
   ...:     return ser.sum()

上記の関数は値 (この場合は RollBasis) を取り、その値に基づいてデータフレーム列 ToRoll にインデックスを付けます。返される系列は、RollBasis + 5 基準を満たす ToRoll 値で構成されます。最後に、その系列が合計されて返されます。

In [3]: df['Rolled'] = df.RollBasis.apply(f)

In [4]: df
Out[4]:
   RollBasis  ToRoll  Rolled
0          1       1      -4
1          1       4      -4
2          1      -5      -4
3          2       2      -4
4          3      -4      -6
5          5      -2      -2
6          8       0     -15
7         10     -13     -20
8         12      -2      -7
9         13      -5      -5

他の誰かが試してみたい場合に備えて、おもちゃの例の DataFrame のコード:

In [1]: from pandas import *

In [2]: import io

In [3]: text = """\
   ...:    RollBasis  ToRoll
   ...: 0          1       1
   ...: 1          1       4
   ...: 2          1      -5
   ...: 3          2       2
   ...: 4          3      -4
   ...: 5          5      -2
   ...: 6          8       0
   ...: 7         10     -13
   ...: 8         12      -2
   ...: 9         13      -5
   ...: """

In [4]: df = read_csv(io.BytesIO(text), header=0, index_col=0, sep='\s+')

score 16 · Accepted Answer

Zelazny7 の回答に基づいて、このより一般的なソリューションを作成しました。

def rollBy(what, basis, window, func):
    def applyToWindow(val):
        chunk = what[(val<=basis) & (basis<val+window)]
        return func(chunk)
    return basis.apply(applyToWindow)

>>> rollBy(d.ToRoll, d.RollBasis, 5, sum)
0    -4
1    -4
2    -4
3    -4
4    -6
5    -2
6   -15
7   -20
8    -7
9    -5
Name: RollBasis

に比べて非常に遅いため、まだ理想的ではありませrolling_applyんが、これは避けられないことかもしれません。

score 13 · Accepted Answer

BrenBarns の回答に基づいていますが、ブール値ベースのインデックス作成ではなく、ラベルベースのインデックス作成を使用することで高速化されています。

def rollBy(what,basis,window,func,*args,**kwargs):
    #note that basis must be sorted in order for this to work properly     
    indexed_what = pd.Series(what.values,index=basis.values)
    def applyToWindow(val):
        # using slice_indexer rather that what.loc [val:val+window] allows
        # window limits that are not specifically in the index
        indexer = indexed_what.index.slice_indexer(val,val+window,1)
        chunk = indexed_what[indexer]
        return func(chunk,*args,**kwargs)
    rolled = basis.apply(applyToWindow)
    return rolled

これは、インデックス付きの列を使用しないよりもはるかに高速です。

In [46]: df = pd.DataFrame({"RollBasis":np.random.uniform(0,1000000,100000), "ToRoll": np.random.uniform(0,10,100000)})

In [47]: df = df.sort("RollBasis")

In [48]: timeit("rollBy_Ian(df.ToRoll,df.RollBasis,10,sum)",setup="from __main__ import rollBy_Ian,df", number =3)
Out[48]: 67.6615059375763

In [49]: timeit("rollBy_Bren(df.ToRoll,df.RollBasis,10,sum)",setup="from __main__ import rollBy_Bren,df", number =3)
Out[49]: 515.0221037864685

平均的なケースでは、インデックスベースのソリューションは O(n) であるのに対し、論理スライスバージョンは O(n^2) であることに注意してください (私はそう思います)。

ベーシスのすべての値ではなく、ベーシスの最小値からベーシスの最大値まで等間隔のウィンドウでこれを行う方が便利だと思います。これは、関数を次のように変更することを意味します。

def rollBy(what,basis,window,func,*args,**kwargs):
    #note that basis must be sorted in order for this to work properly
    windows_min = basis.min()
    windows_max = basis.max()
    window_starts = np.arange(windows_min, windows_max, window)
    window_starts = pd.Series(window_starts, index = window_starts)
    indexed_what = pd.Series(what.values,index=basis.values)
    def applyToWindow(val):
        # using slice_indexer rather that what.loc [val:val+window] allows
        # window limits that are not specifically in the index
        indexer = indexed_what.index.slice_indexer(val,val+window,1)
        chunk = indexed_what[indexer]
        return func(chunk,*args,**kwargs)
    rolled = window_starts.apply(applyToWindow)
    return rolled

python - カウントではなく値に基づくウィンドウを使用したパンダローリング計算

4 に答える 4

Related

Reference