cuda - CUDA スレッドブロックサイズ 1024 が機能しない (cc=20、sm=21)

Question

私の実行中の構成: - CUDA ツールキット 5.5 - NVidia Nsight Eclipse エディション - Ubuntu 12.04 x64 - CUDA デバイスは NVidia GeForce GTX 560: cc=20、sm=21 (ご覧のとおり、最大 1024 スレッドのブロックを使用できます)

ディスプレイを iGPU (Intel HD Graphics) でレンダリングするので、Nsight デバッガーを使用できます。

ただし、スレッドを960以上に設定すると、奇妙な動作に遭遇しました。

コード：

#include <stdio.h>
#include <cuda_runtime.h>

__global__ void mytest() {
    float a, b;
    b = 1.0F;
    a = b / 1.0F;
}

int main(void) {

    // Error code to check return values for CUDA calls
    cudaError_t err = cudaSuccess;

    // Here I run my kernel
    mytest<<<1, 961>>>();

    err = cudaGetLastError();

    if (err != cudaSuccess) {
        fprintf(stderr, "error=%s\n", cudaGetErrorString(err));
        exit (EXIT_FAILURE);
    }

    // Reset the device and exit
    err = cudaDeviceReset();

    if (err != cudaSuccess) {
        fprintf(stderr, "Failed to deinitialize the device! error=%s\n",
                cudaGetErrorString(err));
        exit (EXIT_FAILURE);
    }

    printf("Done\n");
    return 0;
}

そして...うまくいきません。問題は、フロート除算を含むコードの最後の行にあります。float で割ろうとするたびに、コードはコンパイルされますが、機能しません。実行時の出力エラーは次のとおりです。

エラー = 起動に要求されたリソースが多すぎます

ステップオーバーすると、デバッグで次のようになります。

警告: Cuda API エラーが検出されました: cudaLaunch が返されました (0x7)

-Xptxas -v を使用して出力をビルドします。

12:57:39 **** Incremental Build of configuration Debug for project block_size_test ****
make all 
Building file: ../src/vectorAdd.cu
Invoking: NVCC Compiler
/usr/local/cuda-5.5/bin/nvcc -I"/usr/local/cuda-5.5/samples/0_Simple" -I"/usr/local/cuda-5.5/samples/common/inc" -G -g -O0 -m64 -keep -keep-dir /home/vitrums/cuda-workspace-trashcan -optf /home/vitrums/cuda-workspace/block_size_test/options.txt -gencode arch=compute_20,code=sm_20 -gencode arch=compute_20,code=sm_21 -odir "src" -M -o "src/vectorAdd.d" "../src/vectorAdd.cu"
/usr/local/cuda-5.5/bin/nvcc --compile -G -I"/usr/local/cuda-5.5/samples/0_Simple" -I"/usr/local/cuda-5.5/samples/common/inc" -O0 -g -gencode arch=compute_20,code=compute_20 -gencode arch=compute_20,code=sm_21 -keep -keep-dir /home/vitrums/cuda-workspace-trashcan -m64 -optf /home/vitrums/cuda-workspace/block_size_test/options.txt  -x cu -o  "src/vectorAdd.o" "../src/vectorAdd.cu"
../src/vectorAdd.cu(7): warning: variable "a" was set but never used

../src/vectorAdd.cu(7): warning: variable "a" was set but never used

ptxas info    : 4 bytes gmem, 8 bytes cmem[14]
ptxas info    : Function properties for _ZN4dim3C1Ejjj
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Compiling entry function '_Z6mytestv' for 'sm_21'
ptxas info    : Function properties for _Z6mytestv
    8 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 34 registers, 8 bytes cumulative stack size, 32 bytes cmem[0]
ptxas info    : Function properties for _ZN4dim3C2Ejjj
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
Finished building: ../src/vectorAdd.cu

Building target: block_size_test
Invoking: NVCC Linker
/usr/local/cuda-5.5/bin/nvcc --cudart static -m64 -link -o  "block_size_test"  ./src/vectorAdd.o   
Finished building target: block_size_test


12:57:41 Build Finished (took 1s.659ms)

-keep キーを追加すると、コンパイラは .cubin ファイルを生成しますが、それを読み取ってsmem と reg の値を見つけることができません。find-out-what-resources-/ . 少なくとも最近では、このファイルは別の形式である必要があります。

したがって、この .xls: CUDA_Occupancy_calculatorを考慮すると、ブロックごとに 256 スレッドを使用する必要がありますが、これはおそらく悪い考えではありません。

ともかく。どんな助けでも大歓迎です。

score 5 · Accepted Answer

CUDA Occupancy Calculator ファイルに現在の情報を入力しました。

コンピューティング能力: 2.1
ブロックあたりのスレッド数: 961
スレッドあたりのレジスタ: 34
共有メモリ: 0

占有率は0%で、レジスタ数によって制限されています。
スレッド数を 960 に設定すると、占有率が 63% になり、これが機能する理由を説明しています。

レジスターの数を32に制限し、スレッドの数を1024に設定して 67% の占有率を確保してください。

レジスタの数を制限するには、次のオプションを使用します。 nvcc [...] --maxrregcount=32

cuda - CUDA スレッド ブロック サイズ 1024 が機能しない (cc=20、sm=21)

1 に答える 1

Related

Reference

cuda - CUDA スレッドブロックサイズ 1024 が機能しない (cc=20、sm=21)