cuda - CUDA: Mark Harris による NVIDIA の論文で説明されている合計の並列リダクションコードを実行するにはどうすればよいですか?

Question

このペーパーで説明されている並列削減の背後にあるロジックは理解していますが、入力配列に 1 がある単純な例では実行できないようですsize。

これが私がこれまでに達成したことです。入力および出力データを管理するために、推力ライブラリを使用していることに注意してください。

#include <iostream>
#include <thrust/host_vector.h>
#include <thrust/device_vector.h>
#include <thrust/sort.h>
#include <ctime>
#include <sys/time.h>
#include <sstream>
#include <string>
#include <fstream>

using namespace std;


__global__ void reduce0(int *g_idata, int *g_odata){

   extern __shared__ int sdata[];

  unsigned int tid = threadIdx.x;
  unsigned int i = blockIdx.x*blockDim.x + threadIdx.x;
  sdata[tid] = g_idata[i];

  __syncthreads();

  for(unsigned int s=1; s < blockDim.x; s *= 2) {
     if (tid % (2*s) == 0) {
        sdata[tid] += sdata[tid + s];
     }
  __syncthreads();
 }
 if (tid == 0) g_odata[blockIdx.x] = sdata[0];

}


int main(void){

  int size = 10;
  thrust::host_vector<int> data_h_i(size, 1);

  //initialize the data, all values will be 1 
  //so the final sum will be equal to 10

  int threadsPerBlock = 256;
  int totalBlocks = size/threadsPerBlock + 1;

  dim3 dimGrid(totalBlocks,1,1);
  dim3 dimBlock(threadsPerBlock, 1, 1);

  thrust::device_vector<int> data_v_i = data_h_i;
  thrust::device_vector<int> data_v_o(size);

  int* output = thrust::raw_pointer_cast(data_v_o.data());
  int* input = thrust::raw_pointer_cast(data_v_i.data());

  reduce0<<<dimGrid, dimBlock>>>(input, output);

  data_v_i.clear();
  data_v_i.shrink_to_fit();

  thrust::host_vector<int> data_h_o = data_v_o;

  data_v_o.clear();
  data_v_o.shrink_to_fit();

  cout<<data_h_o[0]<<endl;


  return 0;

}

コードは単純です。host_vectorサイズのを作成し、sizeすべての値を 1 に初期化します。

次に、各ブロックごとに 256 のスレッドが必要であり、この例に必要なブロックの量を動的に見つけると言います。

簡単にするために、10 個の値のみの配列を作成します。つまり、必要なブロックは 1 つだけです。したがって、最終的な結果を生成するには、1 回のカーネル呼び出しで十分です。

私の質問は次のとおりです。

質問1

nvcc -O3 reduction.cu -arch=sm_21上記の例 ( ) をコンパイルして入力する./a.outと、次のメッセージが表示されます。

terminate called after throwing an instance of 'thrust::system::system_error' what(): unspecified launch failure

ここで何が起こっているのかわかりませんが、エラーは次の行から発生しているようです

sdata[tid] = g_idata[i]

カーネルは論文に記載されているカーネルの正確なコピーであるため、この問題を修正するためにどのような変更が必要かはわかりません。

質問2

最初の問題を修正した場合、上記のコードを任意のサイズの入力配列で機能させるにはどうすればよいでしょうか? たとえば、oursizeが 256 を超える場合、少なくとも 2 つのブロックが必要になるため、各ブロックは出力を提供し、それを他のブロックの出力と組み合わせる必要があります。この論文では、カーネルを複数回呼び出す必要があると書かれていますが、これを動的に実行する方法がわかりません。

前もって感謝します

EDIT1：質問1の場合、共有メモリにメモリを正しく割り当てていないようです。そのようにカーネルを呼び出します:reduce0<<<dimGrid, dimBlock, size*sizeof(int)>>>(input, output);また、範囲外でないかどうかを確認しtidます。コードが正しく動作するようにします。新しいカーネルは次のとおりです。

__global__ void reduce0(int *g_idata, int *g_odata, int size){

   extern __shared__ int sdata[];

   unsigned int tid = threadIdx.x;
   unsigned int i = blockIdx.x*blockDim.x + threadIdx.x;

   if(tid<size){

     sdata[tid] = g_idata[i];
     __syncthreads();

    for(unsigned int s=1; s < size; s *= 2) {
        if (tid % (2*s) == 0) {
         sdata[tid] += sdata[tid + s];
        }
        __syncthreads();
     }

   if (tid == 0) g_odata[blockIdx.x] = sdata[0];

  }

}

ただし、質問2についてはまだわかりません。

score 5 · Accepted Answer

質問1

カーネルは動的に割り当てられた共有メモリを使用しています:

extern __shared__ int sdata[];
...
sdata[tid] = g_idata[i];

ただし、カーネル呼び出しで動的共有メモリを割り当てていません。

reduce0<<<dimGrid, dimBlock>>>(input, output);
                           ^
                           |
                           missing shared memory parameter.

そのため、共有メモリにアクセスしようとすると、カーネルエラーが発生します。ちなみに、カーネル呼び出しでcuda エラーチェックを行うことはできます (他の場所で推力を使用している場合でも)。

質問2

質問 2 は、こちらのマークの論文でかなりよく答えられています。スライド 9 の下部で、各ブロックがその部分的な結果を、ブロックごとに 1 つの結果を格納するグローバルメモリ (g_odata[]) 内の配列に書き込むことがわかります。次に、元の入力データの代わりに g_odata[] で動作する基本的に同じタイプの別のカーネルを起動するだけです。このプロセスは、部分的な結果 (g_odata[] など) に 256 個の結果しか含まれないか、スレッドブロックで起動するスレッドの数が含まれるまで、連続して実行できます。次に、その最終結果を単一のスレッドブロックと合計して、単一の回答値を生成できます。

例は cuda サンプルコードhere にあります。

コードを編集したバージョンを次に示します。これは、2 つのカーネルを順番に呼び出して、より大きなサイズを処理する方法を示しています。私はこれをリダクションプログラミングのパラゴンとは考えていません。概念を説明するために既に書いたことを単純に拡張しただけです。カーネルを使用してより大きなデータサイズを処理しやすくするために、カーネルとメインコード全体にさまざまな変更が加えられていることに注意してください。このメソッドは、(threadsPerBlock ^2) のデータサイズを超えて拡張することはできませんが、コードへの変更を最小限に抑えて、複数のカーネルを順番に呼び出して部分的な結果を合計するという概念を説明するためのものです。

#include <iostream>
#include <thrust/host_vector.h>
#include <thrust/device_vector.h>
#include <thrust/sort.h>
#include <ctime>
#include <sys/time.h>
#include <sstream>
#include <string>
#include <fstream>

using namespace std;


__global__ void reduce0(int *g_idata, int *g_odata, int size){

   extern __shared__ int sdata[];

   unsigned int tid = threadIdx.x;
   unsigned int i = blockIdx.x*blockDim.x + threadIdx.x;
   sdata[tid] = 0;
   if(i<size)
     sdata[tid] = g_idata[i];
   __syncthreads();

  for(unsigned int s=1; s < blockDim.x; s *= 2) {
        if (tid % (2*s) == 0) {
         sdata[tid] += sdata[tid + s];
        }
        __syncthreads();
     }

   if (tid == 0) g_odata[blockIdx.x] = sdata[0];
}

int main(void){

  int size = 40000;
  thrust::host_vector<int> data_h_i(size, 1);

  //initialize the data, all values will be 1
  //so the final sum will be equal to size

  int threadsPerBlock = 256;
  int totalBlocks = (size+(threadsPerBlock-1))/threadsPerBlock;

  thrust::device_vector<int> data_v_i = data_h_i;
  thrust::device_vector<int> data_v_o(totalBlocks);

  int* output = thrust::raw_pointer_cast(data_v_o.data());
  int* input = thrust::raw_pointer_cast(data_v_i.data());
  reduce0<<<totalBlocks, threadsPerBlock, threadsPerBlock*sizeof(int)>>>(input, output, size);

  reduce0<<<1, threadsPerBlock, threadsPerBlock*sizeof(int)>>>(output, input, totalBlocks);
  data_v_o[0] = data_v_i[0];
  data_v_i.clear();
  data_v_i.shrink_to_fit();

  thrust::host_vector<int> data_h_o = data_v_o;

  data_v_o.clear();
  data_v_o.shrink_to_fit();

  cout<<data_h_o[0]<<endl;


  return 0;

}

cuda - CUDA: Mark Harris による NVIDIA の論文で説明されている合計の並列リダクション コードを実行するにはどうすればよいですか?

2 に答える 2

Related

Reference

cuda - CUDA: Mark Harris による NVIDIA の論文で説明されている合計の並列リダクションコードを実行するにはどうすればよいですか?