cuda - cudaMemcpyPeerAsync()で宛先デバイスストリームを定義するには?

Question

cudaMemcpyPeerAsync() を使用して、gpu0 から gpu1 への非同期 memcpy を実行しています。

cudaMemcpyAsync() は、ストリームが gpu0 に使用するオプションを提供しますが、gpu1 には使用しません。どうにかして受信デバイスのストリームも定義できますか?

OpenMP スレッドを使用して各デバイスを管理しています (したがって、それらは別々のコンテキストにあります)。

Visual Profiler は送信デバイスのストリームを表示しますが、受信デバイスの場合、この memcpy は MemCpy (PtoP) に表示されるだけで、どのストリームにも表示されません (デフォルトストリームにも表示されません)。

PS: 私の現在の実装は問題なく動作します。送信と受信の通信を重ねたいだけです。

score 1 · Accepted Answer

There is no API call for a cuda peer copy that allows you to specify streams on both ends. The simple answer to your question is no.

Streams are a way of organizing activity. The cudaMemcpyPeerAsync call will show up in the stream (and device) to which it is assigned. This is the level of control you have with the API.

Since streams dictate (i.e. control, regulate) behavior, being able to assign a cuda task to separate streams (on more than one device, in this case) is a level of control that is not exposed in CUDA. Devices (and streams) are intended to operate asynchronously, and requiring that a particular cuda task satisfy the requirements of two separate streams, (on two separate devices in this case) would introduce a type of synchronization that is not appropriate, and could lead to various kinds of activity stalls, and perhaps even deadlock.

None of the description here, nor the behavior of cudaMemcpyPeerAsync, should prevent you from overlapping copy operations in various directions. In fact, in my opinion, assigning a cuda task to more than one stream would make flexible overlap more difficult to achieve.

If you have difficulty achieving a particular overlap, you should probably describe the problem (i.e., provide a simple reproducer complete compilable SSCCE.org code), and show the current overlap scenario that visual profiler shows, and describe the desired overlap scenario.

cuda - cudaMemcpyPeerAsync()で宛先デバイスストリームを定義するには?

1 に答える 1

Related

Reference