cuda - Cuda gridDim と blockDim

Question

私は何が何であるかを理解していますが、ブロックのサイズを与えることblockDimに問題がありますが、何ですか? インターネットでは、x 座標のブロック数を示すと書かれています。gridDim. BlockdimgridDimgridDim.x

どうすれば何blockDim.x * gridDim.xが得られるかを知ることができますか?

gridDim.xx行にいくつの値があるかを知るにはどうすればよいですか?

たとえば、次のコードを考えてみましょう。

int tid = threadIdx.x + blockIdx.x * blockDim.x;
double temp = a[tid];
tid += blockDim.x * gridDim.x;

while (tid < count)
{
    if (a[tid] > temp)
    {
       temp = a[tid];
    }
    tid += blockDim.x * gridDim.x;
}

tidそれが 0 から始まることは知っています。コードにはtid+=blockDim.x * gridDim.x. tidこの手術の後の今は何ですか？

score 109 · Accepted Answer

blockDim.x,y,z gives the number of threads in a block, in the particular direction
gridDim.x,y,z gives the number of blocks in a grid, in the particular direction
blockDim.x * gridDim.x gives the number of threads in a grid (in the x direction, in this case)

block and grid variables can be 1, 2, or 3 dimensional. It's common practice when handling 1-D data to only create 1-D blocks and grids.

In the CUDA documentation, these variables are defined here

In particular, when the total threads in the x-dimension (gridDim.x*blockDim.x) is less than the size of the array I wish to process, then it's common practice to create a loop and have the grid of threads move through the entire array. In this case, after processing one loop iteration, each thread must then move to the next unprocessed location, which is given by tid+=blockDim.x*gridDim.x; In effect, the entire grid of threads is jumping through the 1-D array of data, a grid-width at a time. This topic, sometimes called a "grid-striding loop", is further discussed in this blog article.

You might want to consider taking a couple of the introductory CUDA webinars available on the NVIDIA webinar page. For example, these 2:

GPU Computing using CUDA C – An Introduction (2010) An introduction to the basics of GPU computing using CUDA C. Concepts will be illustrated with walkthroughs of code samples. No prior GPU Computing experience required
GPU Computing using CUDA C – Advanced 1 (2010) First level optimization techniques such as global memory optimization, and processor utilization. Concepts will be illustrated using real code examples

It would be 2 hours well spent, if you want to understand these concepts better.

The general topic of grid-striding loops is covered in some detail here.

score 55 · Accepted Answer

CUDAプログラミングガイドからの言い換え:

gridDim: この変数には、グリッドの寸法が含まれます。

blockIdx: この変数には、グリッド内のブロックインデックスが含まれます。

blockDim: この変数にはブロックの寸法が含まれます。

threadIdx: この変数には、ブロック内のスレッドインデックスが含まれます。

CUDA のスレッド階層について少し混乱しているようです。簡単に言うと、カーネルには 1 つのグリッドがあります (私は常に 3 次元の立方体として視覚化しています)。その各要素はブロックであり、と宣言されたグリッドはdim3 grid(10, 10, 2);合計 10*10*2 ブロックを持ちます。次に、各ブロックはスレッドの 3 次元立方体です。

そうは言っても、ブロックとグリッドの x 次元のみを使用するのが一般的です。これは、質問のコードが実行しているように見えます。これは、1D 配列を使用している場合に特に重要です。その場合、tid+=blockDim.x * gridDim.x行は事実上、グリッド内の各スレッドの一意のインデックスになります。これは、あなたblockDim.xが各ブロックのサイズになり、あなたgridDim.xがブロックの総数になるためです。

したがって、パラメータを指定してカーネルを起動すると

dim3 block_dim(128,1,1);
dim3 grid_dim(10,1,1);
kernel<<<grid_dim,block_dim>>>(...);

次に、カーネルに次のthreadIdx.x + blockIdx.x*blockDim.xものが効果的に含まれているとします。

threadIdx.x range from [0 ~ 128)

blockIdx.x range from [0 ~ 10)

blockDim.x equal to 128

gridDim.x equal to 10

したがって、を計算するthreadIdx.x + blockIdx.x*blockDim.xと、で定義される範囲内の値が得られます[0, 128) + 128 * [1, 10)。これは、tid 値の範囲が {0, 1, 2, ..., 1279} であることを意味します。これは、カーネル内のすべてのスレッドに一意の識別子を提供するため、スレッドをタスクにマップする場合に役立ちます。

ただし、

int tid = threadIdx.x + blockIdx.x * blockDim.x;
tid += blockDim.x * gridDim.x;

次に、基本的に次のようtid = [0, 128) + 128 * [1, 10) + (128 * 10)になります: 、および tid 値の範囲は {1280, 1281, ..., 2559} です。それがどこに関連するのかはわかりませんが、すべてアプリケーションとスレッドのマッピング方法に依存しますあなたのデータに。このマッピングは、カーネルの起動の中心であり、どのように行うべきかを決定するのはあなたです。カーネルを起動するとき、グリッドとブロックの次元を指定します。カーネル内のデータへのマッピングを強制するのはあなたです。ハードウェアの制限を超えない限り (最新のカードでは、ブロックごとに最大 2^10 スレッド、グリッドごとに 2^16 - 1 ブロックを持つことができます)

score 1 · Accepted Answer

このソースコードでは、4 つのスレッドもあり、カーネル関数は 10 個の配列すべてにアクセスできます。どのように？

#define N 10 //(33*1024)

__global__ void add(int *c){
    int tid = threadIdx.x + blockIdx.x * gridDim.x;

    if(tid < N)
        c[tid] = 1;

    while( tid < N)
    {
        c[tid] = 1;
        tid += blockDim.x * gridDim.x;
    }
}

int main(void)
{
    int c[N];
    int *dev_c;
    cudaMalloc( (void**)&dev_c, N*sizeof(int) );

    for(int i=0; i<N; ++i)
    {
        c[i] = -1;
    }

    cudaMemcpy(dev_c, c, N*sizeof(int), cudaMemcpyHostToDevice);

    add<<< 2, 2>>>(dev_c);
    cudaMemcpy(c, dev_c, N*sizeof(int), cudaMemcpyDeviceToHost );

    for(int i=0; i< N; ++i)
    {
        printf("c[%d] = %d \n" ,i, c[i] );
    }

    cudaFree( dev_c );
}

10 スレッドを作成しない理由例) add<<<2,5>>> or add<5,2>>> N が 10 より大きい場合、合理的に少数のスレッドを作成する必要があるため例) 33*1024.

このソースコードは、このケースの例です。配列は 10、cuda スレッドは 4 です。4 つのスレッドだけで 10 個の配列すべてにアクセスする方法。

cudaの詳細で、threadIdx、blockIdx、blockDim、gridDimの意味についてのページを参照してください。

このソースコードでは、

gridDim.x : 2    this means number of block of x

gridDim.y : 1    this means number of block of y

blockDim.x : 2   this means number of thread of x in a block

blockDim.y : 1   this means number of thread of y in a block

2*2(ブロック * スレッド) であるため、スレッドの数は 4 です。

カーネル関数の追加では、スレッドの 0、1、2、3 インデックスにアクセスできます

->tid = threadIdx.x + blockIdx.x * blockDim.x

①0+0*2=0

②1+0*2=1

③0+1*2=2

④1+1*2=3

index 4, 5, 6, 7, 8, 9 の残りにアクセスする方法。 while ループ内に計算があります。

tid += blockDim.x + gridDim.x in while

** カーネルの最初の呼び出し **

-1 ループ: 0+2*2=4

-2 ループ: 4+2*2=8

-3 ループ: 8+2*2=12 (ただし、この値は false です。out 中!)

** カーネルの 2 回目の呼び出し **

-1 ループ: 1+2*2=5

-2 ループ: 5+2*2=9

-3 ループ: 9+2*2=13 (ただし、この値は偽です。アウト中です!)

** カーネルの 3 回目の呼び出し **

-1 ループ: 2+2*2=6

-2 ループ: 6+2*2=10 (ただし、この値は false です。out 中!)

** カーネルの 4 回目の呼び出し **

-1 ループ: 3+2*2=7

-2 ループ: 7+2*2=11 (ただし、この値は false です。out 中!)

したがって、0、1、2、3、4、5、6、7、8、9 のすべてのインデックスが tid 値でアクセスできます。

このページを参照してください。 http://study.marearts.com/2015/03/to-process-all-arrays-by-reasonably.html 評判が悪くてアップロードできません。

cuda - Cuda gridDim と blockDim

4 に答える 4

Related

Reference