cuda - Hillis & Steele: カーネル関数

Question

Hillis & Steele: カーネル関数が各スレッド実行でどのように機能するかを理解するのを手伝ってくれる人はいますか?

__global__ void scan(float *g_odata, float *g_idata, int n)
 {
    extern __shared__ float temp[]; // allocated on invocation
    int thid = threadIdx.x;
    int pout = 0, pin = 1;
    // load input into shared memory.
    // This is exclusive scan, so shift right by one and set first elt to 0
    temp[pout*n + thid] = (thid > 0) ? g_idata[thid-1] : 0;
    __syncthreads();
    for (int offset = 1; offset < n; offset *= 2)
    {
      pout = 1 - pout; // swap double buffer indices
      pin = 1 - pout;
      if (thid >= offset)
        temp[pout*n+thid] += temp[pin*n+thid - offset];
      else
        temp[pout*n+thid] = temp[pin*n+thid];
     __syncthreads();
    }
    g_odata[thid] = temp[pout*n+thid1]; // write output
}

今以来、私は次のことを理解していますpout=0, pin=1 and thid = [1,bockDim.x]。したがって、最初の同期までは単純に右にシフトします。たとえば、配列がある場合[1 | 2 | 5 | 7 ]、新しい配列は[0 |1 | 2 | 5 | 7 ]です。

の実行をfor loop複数のインスタンス、各インスタンスの各インスタンスと見なしますthId。たとえばthId=0、次の実行を行う場合:

thid=0
- offset=1
- pout = 1-0=1 (関数の先頭で pout 初期化を使用)
- ピン = 1 - 1 = 0; (計算されたばかりの pout を使用、 ei 1 )
- temp[4] = temp[0] ( else ステートメント)
- [0 | 1 | 2 | 5 | 0]
- offset=2
- pout = 1-1=0 (前のステップの pout をループで使用)
- ピン = 1 - 0 =1; (ちょうど計算された値)
- temp[0] = temp[4] ( else ステートメント)
- [0 | 1 | 2 | 5 | 0]
pout 変数と pin 変数は for ループ内の情報に基づいて変更さ
れ、最初にこれらの変数の初期化は考慮されません。同様
にの実行を想像しthid=1ます。
thid=1
- offset=1
- pout = 1 - 0 = 1 (関数の先頭で pout 初期化を使用)
- ピン = 1 - 1 = 0
- temp[4+1] = temp[0+1-1] ( if ステートメント) ???? temp のメモリ範囲外????