cuda - How to understand "Implicit Synchronization" for concurrent kernel

Question

In Nvida CUDA C Programming Guide 4.0, section 3.2.5.5.4, it says that two commands from different streams cannot run concurrently if a device-to-device memory copy is issued in-between them. I am not sure what it exactly means. Hope someone can clarify my confusion.

Let's say my program have two streams, stream 0 and stream 1. The following is the order kernels are launched to these streams.

Kernel 0.0 (stream 0; assume the execution time is 10 ms)

kernel 1.0 (stream 1; assume the execution time is 1 ms)

kernel 1.1 (stream 1; assume the execution time is 3 ms)

kernel 1.2 (stream 1; this kernel causes a device-to-device memory copy, assume the execution time is 1 ms)

kernel 1.3 (stream 1; assume the execution time is 6 ms)

Let's also assume the program doesn't have other overhead and the GPU has enough SM to run these kernels concurrently. My question is if kernel 0.0 can run concurrently with kernel 1.2 and kernel 1.3? What is the running time for the whole program?

score 1 · Accepted Answer

前述のように、デバイス間のメモリコピーはcudaMemcpy()、ホストからを使用して実行されます。カーネルは、自由にグローバルメモリを読み書きできます。カーネルが異なるストリームにある場合、カーネルが重複する可能性がありますが、保証はありません。正確なスピードアップは、各カーネルによるSM使用率によって異なります。Nvidiaは、イベントを使用してカーネル実行の時間を計り（開始タイマーと停止タイマー）、重複するバージョンがシーケンシャルよりも速いかどうかを判断することをお勧めします。この出力を、カーネルをストリーム0に切り替えるか、カーネルの実行をシリアル化するプロファイラーでアプリを実行することと比較できます。

cuda - How to understand "Implicit Synchronization" for concurrent kernel

1 に答える 1

Related

Reference