c++ - 配列サイズが1,000,000より大きい場合、Cudaは正しい答えを出さない

Question

問題になる可能性のある配列サイズを100万に増やすまで、問題なく機能するように見える単純な合計削減コードを作成しました。

#define BLOCK_SIZE 128
#define ARRAY_SIZE 10000

cudaError_t addWithCuda(const long *input, long *output, int totalBlocks, size_t size);

__global__ void sumKernel(const long *input, long *output)
{
    int tid = threadIdx.x;
    int bid = blockDim.x * blockIdx.x;

    __shared__ long data[BLOCK_SIZE];

    if(bid+tid < ARRAY_SIZE)
           data[tid] = input[bid+tid];
    else
           data[tid] = 0;

     __syncthreads();

    for(int i = BLOCK_SIZE/2; i >= 1; i >>= 1)
    {
        if(tid < i)
        data[tid] += data[tid + i];
        __syncthreads(); 
    }

    if(tid == 0)
        output[blockIdx.x] = data[0];
}

int main()
{    
    int totalBlocks = ARRAY_SIZE/BLOCK_SIZE;

    if(ARRAY_SIZE % BLOCK_SIZE != 0)
        totalBlocks++;

    long *input = (long*) malloc(ARRAY_SIZE * sizeof(long) );
    long *output = (long*) malloc(totalBlocks * sizeof(long) );

    for(int i=0; i<ARRAY_SIZE; i++)
    {
        input[i] = i+1 ;
    }
// Add vectors in parallel.
        cudaError_t cudaStatus = addWithCuda(input, output, totalBlocks, ARRAY_SIZE);
        if (cudaStatus != cudaSuccess) {
             fprintf(stderr, "addWithCuda failed!");
             return 1;
        }

    long ans = 0;
    for(int i =0 ; i < totalBlocks ;i++)
    {
        ans = ans + output[i];
    }

    printf("Final Ans : %ld",ans);

// cudaDeviceReset must be called before exiting in order for profiling and
// tracing tools such as Nsight and Visual Profiler to show complete traces.
cudaStatus = cudaDeviceReset();
        if (cudaStatus != cudaSuccess) {
              fprintf(stderr, "cudaDeviceReset failed!");
              return 1;
         }

     getchar();

      return 0;
}

     // Helper function for using CUDA to add vectors in parallel.
     cudaError_t addWithCuda(const long *input, long *output, int totalBlocks, size_t size)
     {
          long *dev_input = 0;
          long *dev_output = 0;

          cudaError_t cudaStatus;

// Choose which GPU to run on, change this on a multi-GPU system.
           cudaStatus = cudaSetDevice(0);
         if (cudaStatus != cudaSuccess) {
             fprintf(stderr, "cudaSetDevice failed!  Do you have a CUDA-capable GPU installed?");
             goto Error;
     }

// Allocate GPU buffers for two vectors (one input, one output)    .

     cudaStatus = cudaMalloc((void**)&dev_input, size * sizeof(long));
     if (cudaStatus != cudaSuccess) {
         fprintf(stderr, "cudaMalloc failed!");
         goto Error;
         }

cudaStatus = cudaMalloc((void**)&dev_output, totalBlocks * sizeof(long));
if (cudaStatus != cudaSuccess) {
    fprintf(stderr, "cudaMalloc failed!");
    goto Error;
}

// Copy input vectors from host memory to GPU buffers.
cudaStatus = cudaMemcpy(dev_input, input, size * sizeof(long), cudaMemcpyHostToDevice);
if (cudaStatus != cudaSuccess) {
    fprintf(stderr, "cudaMemcpy failed!");
    goto Error;
}

cudaStatus = cudaMemcpy(dev_output, output, (totalBlocks) * sizeof(long), cudaMemcpyHostToDevice);
if (cudaStatus != cudaSuccess) {
    fprintf(stderr, "cudaMemcpy failed!");
    goto Error;
}

// Launch a kernel on the GPU with one thread for each element.
sumKernel<<<totalBlocks, BLOCK_SIZE>>>(dev_input, dev_output);

// cudaDeviceSynchronize waits for the kernel to finish, and returns
// any errors encountered during the launch.
cudaStatus = cudaDeviceSynchronize();
if (cudaStatus != cudaSuccess) {
    fprintf(stderr, "cudaDeviceSynchronize returned error code %d after launching addKernel!\n", cudaStatus);
    goto Error;
}

// Copy output vector from GPU buffer to host memory.
cudaStatus = cudaMemcpy(output, dev_output, totalBlocks * sizeof(long), cudaMemcpyDeviceToHost);
if (cudaStatus != cudaSuccess) {
    fprintf(stderr, "cudaMemcpy failed!");
    goto Error;
}

Error:
cudaFree(dev_input);
cudaFree(dev_output);

return cudaStatus;
}

私のGPUデバイスで何かをしなければならない場合の参考のために、私のGPUはGTXX 650tiです。GPUに関する情報は次のとおりです。

マルチプロセッサあたりの最大スレッド数: 2048

ブロックあたりの最大スレッド数: 1024

ブロックの各次元の最大サイズ: 1024 x 1024 x 64

グリッドの各次元の最大サイズ: 2147483647 x 65535 x 65535

最大メモリピッチ: 2147483647 バイト

テクスチャアライメント: 512 バイト

score 2 · Accepted Answer

実際には、答え =couldnt もlongに収まらないため、データ型にlong doubleを使用した後、この問題は解決されました。皆さんありがとう！

score 1 · Accepted Answer

コードの問題の 1 つは、最後の cudaMemcpy が正しく設定されていないことです。

cudaMemcpy(output, dev_output, totalBlocks * sizeof(int), cudaMemcpyDeviceToHost);

すべてのデータは長いsizeof(long)データなので、 notを使用してコピーする必要がありますsizeof(int)

コードのもう 1 つの問題は、長いデータ型に対して間違った printf 形式の識別子を使用していることです。

printf("\n %d \n",output[i]);

代わりに次のようなものを使用してください。

printf("\n %ld \n",output[i]);

sm_30 アーキテクチャ用にコンパイルしていない場合、大きなブロック数で問題が発生する可能性もあります。その場合、適切なcuda エラーチェックによって問題が特定されます。

score 0 · Accepted Answer

の後にエラーをチェックしませんsumKernel<<<totalBlocks, BLOCK_SIZE>>>(dev_input, dev_output);。通常、最後に発生したエラーを確認すると、 error が返されますinvalid configuration argument。行の後に次を追加してみてくださいsumKernel。

cudaStatus = cudaGetLastError();
if (cudaStatus != cudaSuccess) {
    printf(stderr, "sumKernel failed: %s\n", cudaGetErrorString(cudaStatus));
    goto Error;
}

エラーの詳細については、この質問を参照してください。

c++ - 配列サイズが1,000,000より大きい場合、Cudaは正しい答えを出さない

3 に答える 3

Related

Reference