python - リスト内包表記を使用して cuDF の Python で変更された groupby ngroup

Question

groupby().ngroups() パンダの機能に似た機能を作成しようとしています。違いは、各サブグループカウントを 0 から再開することです。次のデータが与えられます。

| EVENT_1 | EVENT_2 |
| ------- | ------- |
|       0 |       3 | 
|       0 |       3 |
|       0 |       3 |
|       0 |       5 |
|       0 |       5 |
|       0 |       5 |
|       0 |       9 |
|       0 |       9 |
|       0 |       9 |
|       1 |       6 |
|       1 |       6 |

私が欲しい

| EVENT_1 | EVENT_2 | EVENT_2A |
| ------- | ------- | -------- |
|       0 |       3 |        0 |
|       0 |       3 |        0 |
|       0 |       3 |        0 |
|       0 |       5 |        1 |
|       0 |       5 |        1 |
|       0 |       5 |        1 |
|       0 |       9 |        2 |
|       0 |       9 |        2 |
|       1 |       6 |        0 |
|       1 |       6 |        0 |

これを実装するための最良の方法はgroupby()、EVENT_1 に対して実行し、各グループ内で EVENT_2 の一意の値を取得してから、EVENT_2A を一意の値のインデックスとして設定することです。たとえば、EVENT_1 == 0グループでは、一意の値は[3, 5, 9]であり、EVENT_2A を EVENT_2 の対応する値の一意の値リストのインデックスに設定します。

私が書いたコードはここにあります。EVENT_2 は常に EVENT_1 に関してソートされるため、O(n) でこのような一意の値を見つけることができることに注意してください。

import cudf
from numba import cuda
import numpy as np

def count(EVENT_2, EVENT_2A):
    # Get unique values of EVENT_2
    uq = [EVENT_2[0]] + [x for i, x in enumerate(EVENT_2) if i > 0 and EVENT_2[i-1] != x]

    for i in range(cuda.threadIdx.x, len(EVENT_2), cuda.blockDim.x):
        # Get corresponding index for each value. This can probably be sped up by mapping 
        # values to indices
        for j, v in enumerate(uq):
            if v == EVENT_2[i]:
                EVENT_2A[i] = j
                break


if __name__ == "__main__":
    data = {
        "EVENT_1":[0,0,0,0,0,0,0,0,1,1],
        "EVENT_2":[3,3,3,5,5,5,9,9,6,6]
    }
    df = cudf.DataFrame(data)
    results = df.groupby(["EVENT_1"], method="cudf").apply_grouped(
        count, 
        incols=["EVENT_2"], 
        outcols={"EVENT_2A":np.int64}
    )
    print(results.sort_index())

これに関する問題は、ユーザー定義関数でのリストの使用に関するエラーがあるように見えることですcount()。Numba によると、その JIT nopython コンパイラはリスト内包表記を処理でき、実際に関数を使用すると

from numba import jit

@jit(nopython=True)
def uq_sorted(my_list):
    return [my_list[0]] + [x for i, x in enumerate(my_list) if i > 0 and my_list[i-1] != x]

非推奨の警告がありますが、機能します。

cudf を使用して得られるエラーは

No implementation of function Function(<numba.cuda.compiler.DeviceFunctionTemplate object at 0x7f782a179fa0>) found for signature:
 
 >>> count <CUDA device function>(array(int64, 1d, C), array(int64, 1d, C))
 
There are 2 candidate implementations:
  - Of which 2 did not match due to:
  Overload in function 'count <CUDA device function>': File: ../../../../test.py: Line 11.
    With argument(s): '(array(int64, 1d, C), array(int64, 1d, C))':
   Rejected as the implementation raised a specific error:
     TypingError: Failed in nopython mode pipeline (step: nopython frontend)
   Unknown attribute 'append' of type list(undefined)<iv=None>
   
   File "test.py", line 12:
   def count(EVENT_2, EVENT_2A):
       uq = [EVENT_2[0]] + [x for i, x in enumerate(EVENT_2) if i > 0 and EVENT_2[i-1] != x]
       ^
   
   During: typing of get attribute at test.py (12)
   
   File "test.py", line 12:
   def count(EVENT_2, EVENT_2A):
       uq = [EVENT_2[0]] + [x for i, x in enumerate(EVENT_2) if i > 0 and EVENT_2[i-1] != x]
       ^

  raised from /project/conda_env/lib/python3.8/site-packages/numba/core/typeinfer.py:1071

During: resolving callee type: Function(<numba.cuda.compiler.DeviceFunctionTemplate object at 0x7f782a179fa0>)
During: typing of call at <string> (10)


File "<string>", line 10:
<source missing, REPL/exec in use?>

これは numba からの非推奨警告に関連していますか? uq静的リストとして設定しても、エラーが発生します。リストの理解の問題、または私の問題全体に対する解決策は大歓迎です。ありがとう。

python - リスト内包表記を使用して cuDF の Python で変更された groupby ngroup

1 に答える 1

Related

Reference