cuda - 優先キューを維持するための cuda カーネル内のストリーム圧縮

Question

cuda プログラムの最適化戦略を探しています。カーネルの for ループ内の反復ごとに、各スレッドがスコアを生成します。ブロックごとにスコアの上位 k を維持するために、スコアの共有優先度キューを維持しています。以下の疑似コードを参照してください。

__global__ gpuCompute(... arguments)
{
    __shared__ myPriorityQueue[k];  //To maintain top k scores ( k < #threads in this block)
    __shared__ scoresToInsertInQueue[#threadsInBlock];
    __shared__ counter;
    for(...)       //About 1-100 million iterations
    {
        int score = calculate_score(...);
        if(score > Minimum element of P. Queue && ...)
        {
            ATOMIC Operation : localCounter = counter++;
            scoresToInsertInQueue[localCounter] = score;
        }
        __syncthreads();
        //Merge scores from scoresToInsertInQueue to myPriorityQueue
        while(counter>0)
        {
            //Parallel insertion of scoresToInsertInQueue[counter] to myPriorityQueue using the participation of k threads in this block
            counter--;  
            __syncthreads(); 
        }
        __syncthreads();
    }
}

上記のコードが皆さんにとって意味のあるものであることを願っています。今、アトミック操作のオーバーヘッドを削除する方法を探しています。各スレッドは、値が優先キューに送られるかどうかに応じて「1」または「0」を保存します。'1000000000100000000' を '11000000000000000000' バッファーに減らして (または '1' のインデックスを知って)、最終的に '1' に対応するスコアをキューに挿入できるように、カーネル内にストリーム圧縮の実装があるかどうか疑問に思っています。
この場合、「1」は非常にまばらになることに注意してください。

score 2 · Accepted Answer

それらが非常にまばらである場合、atomicメソッドは最速である可能性があります。ただし、ここで説明する方法は、最悪の場合のパフォーマンスをより予測可能で限定的なものにします。

決定配列に 1 と 0 を適切に混在させるには、並列スキャンまたはプレフィックスサムを使用して、決定配列から挿入ポイントインデックス配列を構築する方が高速な場合があります。

スコア > 30 を選択してキューに入れるしきい値処理の決定があるとします。私のデータは次のようになります。

scores:     30  32  28  77  55  12  19
score > 30:  0   1   0   1   1   0   0
insert_pt:   0   0   1   1   2   3   3    (an "exclusive prefix sum")

次に、各スレッドは次のようにストレージを選択します。

if (score[threadIdx.x] > 30) temp[threadIdx.x] = 1;
else temp[threadIdx.x] = 0;
__syncthreads();
// perform exclusive scan on temp array into insert_pt array
__syncthreads();
if (temp[threadIdx.x] == 1)
  myPriorityQueue[insert_pt[threadIdx.x]] = score[threadIdx.x];

CUBには高速な並列プレフィックススキャンがあります。

cuda - 優先キューを維持するための cuda カーネル内のストリーム圧縮

1 に答える 1

Related

Reference