performance - CUDA でアトミックを使用せずに合計計算を行う方法

Question

以下のコードでは、 atomicAddを使用せずにsum_array値を計算するにはどうすればよいですか。

カーネル方式

__global__ void calculate_sum( int width,
                               int height,
                               int *pntrs,
                               int2 *sum_array )
{
    int row = blockIdx.y * blockDim.y + threadIdx.y;
    int col = blockIdx.x * blockDim.x + threadIdx.x;

    if ( row >= height || col >= width ) return;

    int idx = pntrs[ row * width + col ];

    //atomicAdd( &sum_array[ idx ].x, col );

    //atomicAdd( &sum_array[ idx ].y, row );

    sum_array[ idx ].x += col;

    sum_array[ idx ].y += row;
}

カーネルを起動

    dim3 dimBlock( 16, 16 );
    dim3 dimGrid( ( width + ( dimBlock.x - 1 ) ) / dimBlock.x, 
                  ( height + ( dimBlock.y - 1 ) ) / dimBlock.y );

score 1 · Accepted Answer

リダクションは、この種の問題の総称です。詳細な説明についてはプレゼンテーションを参照するか、他の例については Google を使用してください。

これを解決する一般的な方法は、スレッドブロック内のグローバルメモリセグメントの並列合計を作成し、結果をグローバルメモリに格納することです。その後、部分的な結果を CPU メモリ空間にコピーし、CPU を使用して部分的な結果を合計し、結果を GPU メモリにコピーして戻します。部分的な結果に対して別の並列合計を実行することにより、メモリのコピーを回避できます。

別のアプローチは、Thrust や CUDPP などの CUDA 用に高度に最適化されたライブラリを使用することです。

score 0 · Accepted Answer

私のCudaは非常にさびていますが、これはおおよその方法です（「Cuda by Example」の厚意により、読むことを強くお勧めします）：

https://developer.nvidia.com/content/cuda-example-introduction-general-purpose-gpu-programming-0

合計する必要がある配列のより良い分割を行います。CUDA のスレッドは軽量ですが、2 つの合計に対して 1 つを生成し、見返りにパフォーマンス上の利点を得ることを期待できるほどではありません。
この時点で、各スレッドはデータのスライスを合計するようにタスクされます。スレッドの数と同じ大きさの共有 int の配列を作成し、各スレッドが計算した部分合計を保存します。
スレッドを同期し、共有メモリ配列を減らします。

（疑似コードとして受け取ってください）

// Code to sum over a slice, essentially a loop over each thread subset
// and accumulate over "localsum" (a local variable)
...

// Save the result in the shared memory
partial[threadidx] = localsum;

// Synchronize the threads:
__syncthreads();

// From now on partial is filled with the result of all computations: you can reduce partial
// we'll do it the illiterate way, using a single thread (it can be easily parallelized)
if(threadidx == 0) {
    for(i = 1; i < nthreads; ++i) {
        partial[0] += partial[i];
    }
}

そして始めましょう: partial[0] は合計 (または計算) を保持します。

トピックのより厳密な議論と約 O(log(n)) で実行されるリダクションアルゴリズムについては、「例による CUDA」の内積の例を参照してください。

お役に立てれば

performance - CUDA でアトミックを使用せずに合計計算を行う方法

2 に答える 2

Related

Reference