performance - CUDA stream is slower than usual kernel

Question

I am trying to understand CUDA streams and I have made my first program with streams, but It is slower than usual kernel function...

why is this code slower

cudaMemcpyAsync(pole_dev, pole, size, cudaMemcpyHostToDevice, stream_1);    
addKernel<<<count/100, 100, 0, stream_1>>>(pole_dev);
cudaMemcpyAsync(pole, pole_dev, size, cudaMemcpyDeviceToHost, stream_1);
cudaThreadSynchronize();  // I don't know difference between cudaThreadSync and cudaDeviceSync
cudaDeviceSynchronize();  // it acts relatively same...

than:

cudaMemcpy(pole_dev, pole, size, cudaMemcpyHostToDevice);
addKernel<<<count/100, 100>>>(pole_dev);
cudaMemcpy(pole, pole_dev, size, cudaMemcpyDeviceToHost);

I thounght that it should run faster ... value of variable count is 6 500 000 (maximum) ... first source code takes 14 millisecconds and second source code takes 11 milliseconds.

Can anybody explain it to me, please?

score 2 · Accepted Answer

このスニペットでは、単一のストリーム ( stream_1) のみを処理するのが好きですが、ストリームを明示的に操作しない場合、実際には CUDA が自動的に処理します。

ストリームと非同期メモリ転送を利用するには、いくつかのストリームを使用し、データと計算をそれぞれに分割する必要があります。

performance - CUDA stream is slower than usual kernel

1 に答える 1

Related

Reference