In Nvida CUDA C Programming Guide 4.0, section 3.2.5.5.4, it says that two commands from different streams cannot run concurrently if a device-to-device memory copy is issued in-between them. I am not sure what it exactly means. Hope someone can clarify my confusion.
Let's say my program have two streams, stream 0 and stream 1. The following is the order kernels are launched to these streams.
Kernel 0.0 (stream 0; assume the execution time is 10 ms)
kernel 1.0 (stream 1; assume the execution time is 1 ms)
kernel 1.1 (stream 1; assume the execution time is 3 ms)
kernel 1.2 (stream 1; this kernel causes a device-to-device memory copy, assume the execution time is 1 ms)
kernel 1.3 (stream 1; assume the execution time is 6 ms)
Let's also assume the program doesn't have other overhead and the GPU has enough SM to run these kernels concurrently. My question is if kernel 0.0 can run concurrently with kernel 1.2 and kernel 1.3? What is the running time for the whole program?