concurrency - MPI + CUDA AWARE、並行カーネル、および MPI_Sendrecv

Question

仕事中に、ちょっとした問題が見つかりました。現在、CUDA 6.0 で MVAPICH-GDR-2.05 と Open MPI 1.7.4 を使用しています。

MPI_SendrecvGPU 間の非連続要素 (行列の列など) の交換に取り組んでおり、2 つのカーネル (1 つはスキャッター用、もう 1 つはギャザー用) と2 つの GPU 間の通信を同時に実行しようとしています。

CUDA プロファイラー (nvprof) を使用してプログラムの動作を確認したところ、奇妙なことがわかりました。

Open MPI 1.7.4 では、3 つの cuda ストリームが同時に動作します。
MVAPICH-gdr-2.05 では、2 つの並行カーネルがあり、MPI_Sendrecvそれらと並行ではありません。

なぜMPI_SendrecvMVAPICHでこれを行うのか知っていますか?

これは私の擬似コードです:

// creation and initialization of streams
cudaStream_t stream1, stream2;
cudaStreamCreateWithFlags( stream1, cudaStreamNonBlocking )
cudaStreamCreateWithFlags( stream2, cudaStreamNonBlocking )

///////////////////////////////////////////////////////////////////

// 1) --> gather of the first chunk
gather_kernel <<< dimGrid, dimBlock, 0, stream1 >>> ( ... )
cudaStreamSynchronize(stream1)

// 2) --> gather of the second chunk
//    --> communication of the first chunk
gather_kernel <<< dimGrid, dimBlock, 0, stream1 >>> ( ... )
MPI_Sendrecv( ... )
cudaStreamSynchronize(stream1)

// 3) --> scatter of the chunk (ii)
//    --> gather of the chunk (ii+2)
//    --> communication of the chunk (ii+1)
// K is the number of chunk
for ( ii=0; ii<K-2; ii++ ){
    scatter_kernel <<< dimGrid, dimBlock, 0, stream2 >>> ( ... )
    gather_kernel  <<< dimGrid, dimBlock, 0, stream1 >>> ( ... )
    MPI_Sendrecv( ... )
    cudaStreamSynchronize(stream2)
    cudaStreamSynchronize(stream1)
}

// 4) --> scatter of the penultimate chunk
//    --> communication of the last chunk
scatter_kernel <<< dimGrid, dimBlock, 0, stream2 >>> ( ... )
MPI_Sendrecv( ... )
cudaStreamSynchronize(stream2)

// 5) --> scatter of the last chunk
scatter_kernel <<< dimGrid, dimBlock, 0, stream2 >>> ( ... )
cudaStreamSynchronize(stream2)

そして、これらは 2 つのプロファイラーのスクリーンショットです。

concurrency - MPI + CUDA AWARE、並行カーネル、および MPI_Sendrecv

0 に答える 0

Related

Reference