c - CUDA カーネルでこのステートメントが遅いのはなぜですか?

Question

私はCUDAを使用していくつかのコンピュータービジョンを行っています。次のコードは、完了するまでに約 20 秒かかります。

__global__ void nlmcuda_kernel(float* fpOMul,/*other input args*/){

float fpODenoised[75];

/*Do awesome stuff to compute fpODenoised*/

//inside nested loops:(This is the statement that is the bottleneck in the code.)
      fpOMul[ii * iwl * iwxh + iindex * iwxh + il] = fpODenoised[ii * iwl +iindex];

}

そのステートメントを

fpOMul[ii * iwl * iwxh + iindex * iwxh + il] = 2.0f;

コードは数秒で完了します。

指定されたステートメントが遅いのはなぜですか? また、高速に実行するにはどうすればよいですか?

score 3 · Accepted Answer

コードを変更すると、コンパイラは fpdenoized コードが不要になったことを認識し、最適化できます。変更した実際のステートメントは、パフォーマンスの違いの直接の原因ではありません。これは、それぞれの ptx または sass コードを調べることで確認できます。

c - CUDA カーネルでこのステートメントが遅いのはなぜですか?

1 に答える 1

Related

Reference