c - CUDA カーネル起動パラメーターの説明は正しいですか?

Question

ここで、いくつかの疑似コードを使用して CUDA 起動パラメーターモデル (または実行構成モデル) を自己説明しようとしましたが、大きな間違いがあったかどうかはわかりません。ありがとうございます。

ここにあります：

/*
  normally, we write kernel function like this.
  note, __global__ means this function will be called from host codes,
  and executed on device. and a __global__ function could only return void.
  if there's any parameter passed into __global__ function, it should be stored
  in shared memory on device. so, kernel function is so different from the *normal*
  C/C++ functions. if I was the CUDA authore, I should make the kernel function more
  different  from a normal C function.
*/

__global__ void
kernel(float *arr_on_device, int n) {
        int idx = blockIdx.x * blockDIm.x + threadIdx.x;
        if (idx < n) {
                arr_on_device[idx] = arr_on_device[idx] * arr_on_device[idx];
        }
}

/*
  after this definition, we could call this kernel function in our normal C/C++ codes !!
  do you feel something wired ? un-consistant ?
  normally, when I write C codes, I will think a lot about the execution process down to
  the metal in my mind, and this one...it's like some fragile codes. break the sequential
  thinking process in my mind.
  in order to make things normal, I found a way to explain: I expand the *__global__ * function
  to some pseudo codes:
*/

#define __foreach(var, start, end) for (var = start, var < end; ++var)

__device__ int
__indexing() {
        const int blockId = blockIdx.x * gridDim.x + gridDim.x * gridDim.y * blockIdx.z;

        return 
                blockId * (blockDim.x * blockDim.y * blockDim.z) +
                threadIdx.z * (blockDim.x * blockDim.y) +
                threadIdx.x;
}

global_config =:
        {
                /*
                  global configuration.
                  note the default values are all 1, so in the kernel codes,
                  we could just ignore those dimensions.
                 */ 
                gridDim.x = gridDim.y = gridDim.z = 1;
                blockDim.x = blockDim.y = blockDim.z = 1;
        };

kernel =:
        {
                /*
                  I thought CUDA did some bad evil-detail-covering things here.
                  it's said that CUDA C is an extension of C, but in my mind,
                  CUDA C is more like C++, and the *<<<>>>* part is too tricky.
                  for example:
                  kernel<<<10, 32>>>(); means kernel will execute in 10 blocks each have 32 threads.

                  dim3 dimG(10, 1, 1);
                  dim3 dimB(32, 1, 1);
                  kernel<<<dimG, dimB>>>(); this is exactly the same thing with above.

                  it's not C style, and C++ style ? at first, I thought this could be done by
                  C++'s constructor stuff, but I checked structure *dim3*, there's no proper
                  constructor for this. this just brroke the semantics of both C and C++. I thought
                  force user to use *kernel<<<dim3, dim3>>>* would be better. So I'd like to keep
                  this rule in my future codes.
                */

                gridDim  = dimG;
                blockDim = dimB;

                __foreach(blockIdx.z,  0, gridDim.z)
                __foreach(blockIdx.y,  0, gridDim.y)
                __foreach(blockIdx.x,  0, gridDim.x)
                __foreach(threadIdx.z, 0, blockDim.z)
                __foreach(threadIdx.y, 0, blockDim.y)
                __foreach(threadIdx.x, 0, blockDim.x)
                {
                        const int idx = __indexing();        
                        if (idx < n) {
                                arr_on_device[idx] = arr_on_device[idx] * arr_on_device[idx];
                        }
                }
        };

/*
  so, for me, gridDim & blockDim is like some boundaries.
  e.g. gridDim.x is the upper bound of blockIdx.x, this is not that obvious for people like me.
 */

/* the declaration of dim3 from vector_types.h of CUDA/include */
struct __device_builtin__ dim3
{
        unsigned int x, y, z;
#if defined(__cplusplus)
        __host__ __device__ dim3(unsigned int vx = 1, unsigned int vy = 1, unsigned int vz = 1) : x(vx), y(vy), z(vz) {}
        __host__ __device__ dim3(uint3 v) : x(v.x), y(v.y), z(v.z) {}
        __host__ __device__ operator uint3(void) { uint3 t; t.x = x; t.y = y; t.z = z; return t; }
#endif /* __cplusplus */
};

typedef __device_builtin__ struct dim3 dim3;

score 12 · Accepted Answer

CUDA ドライバー API

CUDA Driver API v4.0 以降では、次の関数を使用してカーネルの起動を制御します。

cuFuncSetCacheConfig
cuFuncSetSharedMemConfig
cuLaunchKernel

次の CUDA ドライバー API 関数は、v4.0 で cuLaunchKernel が導入される前に使用されていました。

cuFuncSetBlockShape()
cuFuncSetSharedSize()
cuParamSet{Size,i,fv}()
cuLaunch
cuLaunchGrid

これらの関数に関する追加情報は cuda.h にあります。

CUresult CUDAAPI cuLaunchKernel(CUfunction f,
    unsigned int gridDimX,
    unsigned int gridDimY,
    unsigned int gridDimZ,
    unsigned int blockDimX,
    unsigned int blockDimY,
    unsigned int blockDimZ,
    unsigned int sharedMemBytes,
    CUstream hStream,
    void **kernelParams,
    void **extra);

cuLaunchKernel は、起動構成全体をパラメーターとして受け取ります。

詳細については、NVIDIA ドライバー API[実行制御] 1を参照してください。

CUDA カーネルの起動

cuLaunchKernel は 1. 起動パラメータを確認します 2. 共有メモリ構成を変更します 3. ローカルメモリ割り当てを変更します 4. ストリーム同期トークンをコマンドバッファにプッシュして、ストリーム内の 2 つのコマンドが重複しないようにします 4. 起動パラメータをプッシュします5. 起動コマンドをコマンドバッファーにプッシュします。 6. コマンドバッファーをデバイスに送信します (wddm ドライバーでは、この手順は延期される場合があります)。 7. wddm では、カーネルドライバーはデバイスメモリに必要なすべてのメモリをページします。

GPU は 1. コマンドを検証します 2. コマンドをコンピューティングワークディストリビューターに送信します 3. 起動構成とスレッドブロックを SM にディスパッチします

すべてのスレッドブロックが完了すると、ワークディストリビューターはキャッシュをフラッシュして CUDA メモリモデルを尊重し、カーネルを完了としてマークして、ストリーム内の次のアイテムが前進できるようにします。

スレッドブロックがディスパッチされる順序は、アーキテクチャ間で異なります。

コンピューティング機能 1.x デバイスは、カーネルパラメーターを共有メモリに保存します。コンピューティング機能 2.0 ～ 3.5 デバイスは、kenrel パラメーターを定数メモリに保存します。

CUDA ランタイム API

CUDA ランタイムは C++ ソフトウェアライブラリであり、CUDA ドライバー API の上にツールチェーンを構築します。CUDA ランタイムは、次の関数を使用してカーネルの起動を制御します。

cudaConfigureCall cudaFuncSetCacheConfig cudaFuncSetSharedMemConfig cudaLaunch cudaSetupArgument

NVIDIA ランタイム API[実行制御] 2を参照してください。

<<<>>> CUDA 言語拡張は、カーネルを起動するために使用される最も一般的な方法です。

コンパイル中、nvcc は <<<>>> を使用して呼び出されたカーネル関数ごとに新しい CPU スタブ関数を作成し、<<<>>> をスタブ関数の呼び出しに置き換えます。

例えば

__global__ void kernel(float* buf, int j)
{
    // ...
}

kernel<<<blocks,threads,0,myStream>>>(d_buf,j);

生成する

void __device_stub__Z6kernelPfi(float *__par0, int __par1){__cudaSetupArgSimple(__par0, 0U);__cudaSetupArgSimple(__par1, 4U);__cudaLaunch(((char *)((void ( *)(float *, int))kernel)));}

--keep を nvcc コマンドラインに追加すると、生成されたファイルを検査できます。

cudaLaunch は cuLaunchKernel を呼び出します。

CUDA 動的並列処理

CUDA CDP は、上記の CUDA ランタイム API と同様に機能します。

score 3 · Accepted Answer

を使用<<<...>>>することで、GPU で多数のスレッドを起動しています。これらのスレッドはブロックにグループ化され、大きなグリッドを形成します。すべてのスレッドが、呼び出されたカーネル関数コードを実行します。

カーネル関数では、threadIdxやblockIdxenable などの組み込み変数が、コードが実行するスレッドを認識し、作業のスケジュールされた部分を実行します。

編集

基本的に<<<...>>>、カーネルを起動するための構成手順を簡素化します。これを使用しない場合、C99 構文のみを使用する OpenCL の方法と同様に、1 回のカーネル起動で 4 ～ 5 個の API を呼び出さなければならない場合があります。

実際、CUDA ドライバー API を確認できます。これらのすべての API を提供する場合があるため、を使用する必要はありません<<<>>>。

score 1 · Accepted Answer

基本的に、GPU は個別の「デバイス」GPU (たとえば、GeForce 690 には 2 つ) -> 複数の SM (ストリーミングマルチプロセッサ) -> 複数の CUDA コアに分割されます。私の知る限り、ブロックまたはグリッドの次元は、ハードウェアとは関係のない単なる論理割り当てですが、ブロックの合計サイズ(x*y*z) は非常に重要です。

ブロック内のスレッドは、共有メモリと同期の機能を使用するために、同じ SM 上にある必要があります。したがって、SM に含まれる CUDA コアよりも多くのスレッドを持つブロックを持つことはできません。

それぞれ 32 個の CUDA コアを持つ 16 個の SM があり、ブロックサイズが 31x1x1、グリッドサイズが 20x1x1 であるという単純なシナリオがある場合、カードの処理能力の少なくとも 1/32 が失われます。ブロックが実行されるたびに、SM は 32 のコアのうち 31 のみがビジーになります。ブロックがロードされて SM がいっぱいになり、16 個のブロックがほぼ同時に終了し、最初の 4 個の SM が解放されると、最後の 4 ブロックの処理が開始されます (ブロック #17-20 である必要はありません)。

コメントと修正は大歓迎です。

c - CUDA カーネル起動パラメーターの説明は正しいですか?

3 に答える 3

編集

Related

Reference