c++ - CUDA デバイスからホストへのコピーが非常に遅い

Question

Windows 7 64 ビット、cuda 4.2、Visual Studio 2010 を実行しています。

まず、cuda でコードを実行し、データをホストにダウンロードします。次に、いくつかの処理を実行して、デバイスに戻ります。次に、デバイスからホストへの次のコピーを実行しました。1 ミリ秒のように非常に高速に実行されます。

clock_t start, end;
count=1000000;
thrust::host_vector <int> h_a(count);
thrust::device_vector <int> d_b(count,0);
int *d_bPtr = thrust::raw_pointer_cast(&d_b[0]);
start=clock();
thrust::copy(d_b.begin(), d_b.end(), h_a.begin());
end=clock();
cout<<"Time Spent:"<<end-start<<endl;

完了するまでに約 1 ミリ秒かかります。

次に、主にアトミック操作など、他のコードを cuda で再度実行しました。次に、デバイスからホストにデータをコピーします。非常に長い時間がかかります。約 9 秒です。

__global__ void dosomething(int *d_bPtr)
{
....
atomicExch(d_bPtr,c)
....
}

start=clock();
thrust::copy(d_b.begin(), d_b.end(), h_a.begin());
end=clock();
cout<<"Time Spent:"<<end-start<<endl;

〜9秒

たとえば、コードを複数回実行しました

int i=0;
while (i<10)
{
clock_t start, end;
count=1000000;
thrust::host_vector <int> h_a(count);
thrust::device_vector <int> d_b(count,0);
int *d_bPtr = thrust::raw_pointer_cast(&d_b[0]);
start=clock();
thrust::copy(d_b.begin(), d_b.end(), h_a.begin());
end=clock();
cout<<"Time Spent:"<<end-start<<endl;

__global__ void dosomething(int *d_bPtr)
{
....
atomicExch(d_bPtr,c)
....
}

start=clock();
thrust::copy(d_b.begin(), d_b.end(), h_a.begin());
end=clock();
cout<<"Time Spent:"<<end-start<<endl;
i++
}

結果はほとんど同じです。
何が問題なのですか？

ありがとうございました！

score 10 · Accepted Answer

問題はタイミングの問題であり、コピーパフォーマンスの変化の問題ではありません。カーネルの起動は CUDA では非同期であるため、測定しているのは時間だけでなくthrust::copy、以前に起動したカーネルが完了するまでの時間でもあります。コピー操作のタイミングのコードを次のように変更すると、次のようになります。

cudaDeviceSynchronize(); // wait until prior kernel is finished
start=clock();
thrust::copy(d_b.begin(), d_b.end(), h_a.begin());
end=clock();
cout<<"Time Spent:"<<end-start<<endl;

転送時間が以前のパフォーマンスに復元されていることがわかります。したがって、本当の質問は「なぜthrust::copy遅いのか」ではなく、「なぜ私のカーネルが遅いのか」です。そして、投稿したかなりひどい疑似コードに基づいて、答えは「atomicExch()カーネルメモリトランザクションをシリアル化する呼び出しでいっぱいだから」です。

score -1 · Accepted Answer

cudppを使用することをお勧めします。私の意見では、推力よりも高速です (私は最適化に関する修士論文を書いており、両方のライブラリを試しました)。コピーが非常に遅い場合は、独自のカーネルを作成してデータをコピーできます。

c++ - CUDA デバイスからホストへのコピーが非常に遅い

2 に答える 2

Related

Reference