cuda - CUDAの定数メモリのデータに関する問題

Question

定数メモリを使用しようとしている CUDA アプリケーションがあります。しかし、メイン関数と同じファイルにカーネルを記述している場合、定数メモリ内のデータのみがカーネル内で認識されます。それ以外の場合、カーネル関数を他のファイルで宣言すると、定数メモリが 0 になり、操作が正常に動作します。問題をより簡単に説明するための簡単なダミーコードを提供します。このプログラムには、16x16 ブロックに分割された 48x48 マトリックスがあり、1 から 50 までの乱数を格納しています。カーネル内では、定数メモリに格納されている数値をブロック内の各行に追加しています。コードは以下のとおりです。

ヘッダーファイル:

#include <windows.h>
#include <dos.h>
#include <stdio.h>
#include <conio.h>
#include <math.h>

#include <cuda.h>
#include <cuda_runtime.h>
#include <cutil.h>
#include <curand.h>
#include <curand_kernel.h>

__constant__ int test_cons[16];

__global__ void test_kernel_1(int *,int *);

メインプログラム：

int main(int argc,char *argv[])
{   int *mat,*dev_mat,*res,*dev_res;
    int i,j;
    int test[16 ]   = {1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16};
    cudaMemcpyToSymbol(test_cons,test,16*sizeof(int));
    mat = (int *)malloc(48*48*sizeof(int));
    res = (int *)malloc(48*48*sizeof(int));
    memset(res,0,48*48*sizeof(int));

    srand(time(NULL));
    for(i=0;i<48;i++)
    {   for(j=0;j<48;j++)
        {   mat[i*48+j] = rand()%(50-1)+1;
            printf("%d\t",mat[i*48+j] );
        }
        printf("\n");
    }

    cudaMalloc((void **)&dev_mat,48*48*sizeof(int));
    cudaMemcpy(dev_mat,mat,48*48*sizeof(int),cudaMemcpyHostToDevice);
    cudaMalloc((void **)&dev_res,48*48*sizeof(int));

    dim3 gridDim(48/16,48/16,1);
    dim3 blockDim(16,16,1);

    test_kernel_1<<< gridDim,blockDim>>>(dev_mat,dev_res);

    cudaMemcpy(res,dev_res,48*48*sizeof(int),cudaMemcpyDeviceToHost);

    printf("\n\n\n\n");
    for(i=0;i<48;i++)
    {   for(j=0;j<48;j++)
        {   printf("%d\t",res[i*48+j] );
        }
        printf("\n");
    }

    cudaFree(dev_mat);
    cudaFree(dev_res);
    free(mat);
    free(res);
    exit(0);
}

カーネル機能:

__global__ void test_kernel_1(int *dev_mat,int* dev_res)
{
    int row = blockIdx.y*blockDim.y+threadIdx.y;
    int col = blockIdx.x*blockDim.x +threadIdx.x;

    dev_res[row*48+col] = dev_mat[row*48+col] + test_cons[threadIdx.x];
}

メインプログラムファイル内でカーネル関数をメインプログラムと一緒に宣言すると、定数メモリ値は正しくなります。それ以外の場合、別のファイルにある場合、test_cons[threadIdx.x]値は 0 になります。

同じ問題について議論しているこのリンクに出くわしましたが、適切に取得していません。なぜこれが起こっているのか、この問題を回避するために何をする必要があるのかを誰かが教えてくれれば、非常に役に立ちます. どんな種類の助けでも大歓迎です。ありがとう。

score 2 · Accepted Answer

私は最近、ここで同様の質問に答えました

CUDA は、デバイスコード (エントリポイント) または他のファイルのシンボルを参照するコードを処理できますが、デバイスリンクを使用して個別にコンパイルする必要があります (上記のリンクで説明およびリンクされているように)。(また、個別のコンパイル/リンクには CC 2.0 以降が必要です)

そのため、リンクステップを変更すると、__constant__変数を特定のファイルに格納し、別のファイルから参照することができます。

そうでない場合 (個別のコンパイルとデバイスリンクを指定しない場合)、変数を参照するデバイスコード、__constant__変数を参照するホストコード__constant__、および変数自体の定義/宣言をすべて同じにする必要があります。ファイル.

したがって、この：

__constant__ int test_cons[16];

これ：

cudaMemcpyToSymbol(test_cons,test,16*sizeof(int));

この：

dev_res[row*48+col] = dev_mat[row*48+col] + test_cons[threadIdx.x];

すべてが同じファイルにある必要があります。

cuda - CUDAの定数メモリのデータに関する問題

2 に答える 2

Related

Reference