cuda - 同時計算と共有メモリへのデータロード: タイル化された行列-行列乗算の場合

Question

CUDA の共有メモリの例に基づいて、計算とデータの読み込みを同時に実行する行列乗算アルゴリズムを作成したいと考えています。私は次のようなコードを持っています:

float As[BLOCK_SIZE][BLOCK_SIZE];
float Bs[BLOCK_SIZE][BLOCK_SIZE];
As[ty][tx] = A[aBegin + wA * ty + tx];
Bs[ty][tx] = B[bBegin + wB * ty + tx];
for (int a = aBegin, b = bBegin; a <= aEnd; a += aStep, b += bStep)
{
    __shared__ float A2s[BLOCK_SIZE][BLOCK_SIZE];
    __shared__ float B2s[BLOCK_SIZE][BLOCK_SIZE];
    A2s[ty][tx] = As[ty][tx];
    B2s[ty][tx] = Bs[ty][tx];
    __syncthreads();
    if (a+1 <= aEnd)
    {
        As[ty][tx] = A[a+1 + wA * ty + tx];
        Bs[ty][tx] = B[b+1 + wB * ty + tx]; 
    }
#pragma unroll
    for (int k = 0; k < BLOCK_SIZE; ++k)
    {
         Csub += A2s[ty][k] * B2s[k][tx];
    }   
    __syncthreads();
}

ただし、2 番目のデータの読み込みが計算で順次実行されるため、元のソリューションよりも遅くなります。どうすれば平行にできますか？

score 1 · Accepted Answer

AデータとBをローカル配列Asおよびに移動することは避けてくださいBs。つまり、

As[ty][tx] = A[aBegin + wA * ty + tx];
Bs[ty][tx] = B[bBegin + wB * ty + tx];

それらを共有メモリに直接移動できますA2s。B2sつまり、

A2s[ty][tx] = A[aBegin + wA * ty + tx];
B2s[ty][tx] = B[bBegin + wB * ty + tx];

また、データが読み込まれます

As[ty][tx] = A[a+1 + wA * ty + tx];
Bs[ty][tx] = B[b+1 + wB * ty + tx];

未使用のようです。

最後に、共有メモリ配列の宣言をループの外に移動する必要がありfor、出力行列への最終代入も欠落していました。

次のようなものを試してください：

__global__ void TiledMatrixMultiplicationKernel(float* A, float* B, float* C, int Width)
{
    __shared__float As[BLOCK_SIZE][BLOCK_SIZE];
    __shared__float Bs[BLOCK_SIZE][BLOCK_SIZE];
    int bx = blockIdx.x; int by = blockIdx.y;
    int tx = threadIdx.x; int ty = threadIdx.y;
    int Row = by * BLOCK_SIZE + ty;
    int Col = bx * BLOCK_SIZE + tx;
    float Csub = 0;
    for (int m = 0; m < Width/BLOCK_SIZE; ++m) {
    As[ty][tx] = A[Row*Width + (m*BLOCK_SIZE + tx)];
    Bs[ty][tx] = B[Col + (m*BLOCK_SIZE + ty)*Width];
    __syncthreads();
    for (int k = 0; k < BLOCK_SIZE; ++k) {
       Csub += As[ty][k] * Bs[k][tx];
       __syncthreads();
    }
    C[Row*Width+Col] = Csub;
}

cuda - 同時計算と共有メモリへのデータロード: タイル化された行列-行列乗算の場合

1 に答える 1

Related

Reference