cuda - 複数回実行する必要がある CUDA カーネルのタイミング

Question

複数回実行するカーネルの時間をカウントしたいのですが、実行中のカーネルごとに処理するデータが異なります。

1 cudaEvent_t start;
2 error = cudaEventCreate(&start);
3 cudaEvent_t stop;
4 error = cudaEventCreate(&stop);
6 float msecTotal = 0.0f;
7 int nIter = 300;
8 for (int j = 0; j < nIter; j++)
9 {            
10      cudaMemcpy(...);
        // Record the start event
11      error = cudaEventRecord(start, NULL);
12      matrixMulCUDA1<<< grid, threads >>>(...);
       // Record the stop event
13      error = cudaEventRecord(stop, NULL);
14      error = cudaEventSynchronize(stop);
15      float msec = 0.0f;
16      error = cudaEventElapsedTime(&msec, start, stop);
17      msecTotal+=msec;
18 }
19 cout<<"Total time = "<<msecTotal<<endl;

公平を期すために、コントラストアルゴリズムは次のようになります。

1 cudaEvent_t start;
2 error = cudaEventCreate(&start);
3 cudaEvent_t stop;
4 error = cudaEventCreate(&stop);
6 float msecTotal = 0.0f;
7 int nIter = 300;
8 for (int j = 0; j < nIter; j++)
9 {
        // Record the start event    
11      error = cudaEventRecord(start, NULL);
12      matrixMulCUDA2<<< grid, threads >>>(...);
       // Record the stop event
13      error = cudaEventRecord(stop, NULL);
14      error = cudaEventSynchronize(stop);
15      float msec = 0.0f;
16      error = cudaEventElapsedTime(&msec, start, stop);
17      msecTotal+=msec;
18 }
19 cout<<"Total time = "<<msecTotal<<endl;

私の質問は、その方法は正しいですか？よくわからないからです。明らかに、時間は通常よりも長くする必要があります。

score 1 · Accepted Answer

どちらの方法でも同様の結果が得られるはずです。カーネルの起動に関するイベントを記録することで、memcpy で費やされた時間ではなく、カーネルで費やされた時間のみを確実に測定しています。

私の唯一の欠点は、ループの反復ごとに cudaEventSynchronize() を呼び出すことで、実際には優れたパフォーマンスを得るために非常に重要な CPU/GPU の同時実行性を壊していることです。(操作全体ではなく、カーネル呼び出しの周りに nIter 反復の for ループを配置するのではなく) 各カーネル呼び出しを別々に時間測定する必要がある場合は、より多くの CUDA イベントを割り当てたい場合があります。そのルートに進む場合、ループ反復ごとに 2 つのイベントは必要ありません。操作を 2 つにまとめて、ループ反復ごとに 1 つの CUDA イベントのみを記録する必要があります。次に、隣接する記録されたイベントで cudaEventElapsedTime() を呼び出すことにより、特定のカーネル呼び出しの時間を計算できます。

N イベント間の GPU 時間を記録するには:

cudaEvent_t events[N+2];

cudaEventRecord( events[0], NULL ); // record first event
for (j = 0; j < nIter; j++ ) {
    // invoke kernel, or do something else you want to time
    // cudaEventRecord( events[j+1], NULL );
}
cudaEventRecord( events[j], NULL );
// to compute the time taken for operation i, call:
float ms;
cudaEventElapsedTime( &ms, events[i+1], events[i] );

cuda - 複数回実行する必要がある CUDA カーネルのタイミング

1 に答える 1

Related

Reference