arrays - CUDA での並列化、各列へのスレッドの割り当て

Question

MxN 2D 行列から変換された 1D 配列があり、各列を並列化し、いくつかの操作を実行したいとします。各列にスレッドを割り当てるにはどうすればよいですか?

たとえば、3x3 マトリックスがある場合:

そして、列番号に応じて列の各数値を追加したい (したがって、1 番目の列は 1 を追加し、2 番目の列は 2 を追加します....)、次のようになります。

1+1   2+1   3+1

4+2   5+2   6+2

7+3   8+3   9+3

CUDAでこれを行うにはどうすればよいですか? 配列内のすべての要素にスレッドを割り当てる方法は知っていますが、各列にスレッドを割り当てる方法はわかりません。したがって、私が望むのは、各列 (1 , 2 ,3 ) ( 4 , 5 ,6 ) (7 , 8 ,9) を送信して操作を行うことです。

score 3 · Accepted Answer

あなたの例では、行に基づいて数字を追加しています。それでも、行列の行/列の長さはわかっています (MxN であることはわかっています)。あなたができることは次のようなものです：

__global__ void MyAddingKernel(int* matrix, int M, int N)
{

    int gid = threadIdx.x + blockDim.x*blockIdx.x;
    //Let's add the row number to each element
    matrix[ gid ] += gid % M;
    //Let's add the column number to each element
    matrix[ gid ] += gid % N;

}

別の番号を追加したい場合は、次のようにすることができます。

matrix[ gid ] += my_col_number_function(gid%N);

score 1 · Accepted Answer

Use a better grid layout to avoid those modulo operations.

Use the unique block index for the rows which is 64-bit range on latest Cuda.

Let the threads iterate in a loop over all elements and add the unique thread index!

Tiling input data is a general approach if calculated data is uniquely across a block (rows), especially for more complex calculations.

/*
 * @param tileCount 
 */
__global__ void addRowNumberToCells(int* inOutMat_g, 
    const unsigned long long int inColumnCount_s, 
    const int inTileCount_s)
{

    //get unique block index
    const unsigned long long int blockId = blockIdx.x //1D
        + blockIdx.y * gridDim.x //2D
        + gridDim.x * gridDim.y * blockIdx.z; //3D

    /* 
     * check column ranges in case kernel is called 
     * with more blocks then columns 
     * (since its block wide following syncthreads are safe)
     */
    if(blockId >= inColumnCount_s)
        return;

    //get unique thread index
    const unsigned long long int threadId = blockId * blockDim.x + threadIdx.x; 

    /*
     * calculate unique and 1 blockId
     * maybe shared memory is overhead 
     * but it shows concept if calculation is more complex
     */
    __shared__ unsigned long long int blockIdAnd1_s;
    if(threadIdx.x == 0)
        blockIdAnd1_s = blockId + 1;
    __sycnthreads();


    unsigned long long int idx;

    //loop over tiles
    for(int i = 0; i < inTileCount_s)
    {
        //calculate new offset for sequence thread writes
        idx = i * blockDim.x + threadIdx.x;
        //check new index range in case column count is no multiple of blockDim.x
        if(idx >= inColumnCount_s)
            break;
        inOutMat_g[idx] = blockIdAnd1_s;
    }

}

Example Cuda 2.0:

mat[131000][1000]

Necessary blockCount = 131000 / 65535 = 2 for blockDim.y rounded up!

inTileCount_s = 1000 / 192 = 6 rounded up!

(192 Threads per Block = 100 occupancy on Cuda 2.0)

<<(65535, 2, 1), (192, 1, 1)>>addRowNumberToCells(mat, 1000, 6)

arrays - CUDA での並列化、各列へのスレッドの割り当て

2 に答える 2

Related

Reference