gpu - OpenACC red-black Gauss-Seidel は CPU より遅い

Question

ラプラス方程式 (単純な熱板問題) の赤黒ガウスザイデルソルバーに OpenACC ディレクティブを追加しましたが、GPU で高速化されたコードは、大規模な問題であっても CPU よりも高速ではありません。

私は CUDA バージョンも作成しましたが、これは両方よりもはるかに高速です (512x512 の場合、CPU と OpenACC の 25 秒と比較して 2 秒程度)。

この不一致の理由を考えられる人はいますか? CUDA が最も潜在的な速度を提供することを認識していますが、OpenACC は、より大きな問題に対して CPU よりも優れたものを提供する必要があります (ここで示されているのと同じ種類の問題に対するヤコビソルバーのように)。

関連するコードは次のとおりです (完全な作業ソースはこちらです)。

#pragma acc data copyin(aP[0:size], aW[0:size], aE[0:size], aS[0:size], aN[0:size], b[0:size]) copy(temp_red[0:size_temp], temp_black[0:size_temp])
// red-black Gauss-Seidel with SOR iteration loop
for (iter = 1; iter <= it_max; ++iter) {
  Real norm_L2 = 0.0;

  // update red cells
  #pragma omp parallel for shared(aP, aW, aE, aS, aN, temp_black, temp_red) \
      reduction(+:norm_L2)
  #pragma acc kernels present(aP[0:size], aW[0:size], aE[0:size], aS[0:size], aN[0:size], b[0:size], temp_red[0:size_temp], temp_black[0:size_temp])
  #pragma acc loop independent gang vector(4)
  for (int col = 1; col < NUM + 1; ++col) {
    #pragma acc loop independent gang vector(64)
    for (int row = 1; row < (NUM / 2) + 1; ++row) {

      int ind_red = col * ((NUM / 2) + 2) + row;        // local (red) index
      int ind = 2 * row - (col % 2) - 1 + NUM * (col - 1);  // global index

      #pragma acc cache(aP[ind], b[ind], aW[ind], aE[ind], aS[ind], aN[ind])

      Real res = b[ind] + (aW[ind] * temp_black[row + (col - 1) * ((NUM / 2) + 2)]
                         + aE[ind] * temp_black[row + (col + 1) * ((NUM / 2) + 2)]
                         + aS[ind] * temp_black[row - (col % 2) + col * ((NUM / 2) + 2)]
                         + aN[ind] * temp_black[row + ((col + 1) % 2) + col * ((NUM / 2) + 2)]);

      Real temp_old = temp_red[ind_red];
      temp_red[ind_red] = temp_old * (1.0 - omega) + omega * (res / aP[ind]);

      // calculate residual
      res = temp_red[ind_red] - temp_old;
      norm_L2 += (res * res);

    } // end for row
  } // end for col

  // update black cells
  #pragma omp parallel for shared(aP, aW, aE, aS, aN, temp_black, temp_red) \
          reduction(+:norm_L2)
  #pragma acc kernels present(aP[0:size], aW[0:size], aE[0:size], aS[0:size], aN[0:size], b[0:size], temp_red[0:size_temp], temp_black[0:size_temp])
  #pragma acc loop independent gang vector(4)
  for (int col = 1; col < NUM + 1; ++col) {
    #pragma acc loop independent gang vector(64)
    for (int row = 1; row < (NUM / 2) + 1; ++row) {

      int ind_black = col * ((NUM / 2) + 2) + row;      // local (black) index
      int ind = 2 * row - ((col + 1) % 2) - 1 + NUM * (col - 1);    // global index

      #pragma acc cache(aP[ind], b[ind], aW[ind], aE[ind], aS[ind], aN[ind])

      Real res = b[ind] + (aW[ind] * temp_red[row + (col - 1) * ((NUM / 2) + 2)]
                         + aE[ind] * temp_red[row + (col + 1) * ((NUM / 2) + 2)]
                         + aS[ind] * temp_red[row - ((col + 1) % 2) + col * ((NUM / 2) + 2)]
                         + aN[ind] * temp_red[row + (col % 2) + col * ((NUM / 2) + 2)]);

      Real temp_old = temp_black[ind_black];
      temp_black[ind_black] = temp_old * (1.0 - omega) + omega * (res / aP[ind]);

      // calculate residual
      res = temp_black[ind_black] - temp_old;       
      norm_L2 += (res * res);

    } // end for row
  } // end for col

  // calculate residual
  norm_L2 = sqrt(norm_L2 / ((Real)size));

  if(iter % 100 == 0) printf("%5d, %0.6f\n", iter, norm_L2);

  // if tolerance has been reached, end SOR iterations
  if (norm_L2 < tol) {
    break;
  }
}

score 3 · Accepted Answer

よし、小さな問題の時間を大幅に短縮する準解決策を見つけた。

行を挿入すると：

acc_init(acc_device_nvidia);
acc_set_device_num(0, acc_device_nvidia);

GPU を有効にして設定するためにタイマーを開始する前に、512x512 の問題の時間は 9.8 秒に短縮され、1024x1024 の場合は 42 秒に短縮されました。問題のサイズをさらに大きくすると、OpenACC でさえ 4 つの CPU コアで実行する場合と比べてどれだけ高速かがわかります。

この変更により、OpenACC コードは CUDA コードよりも 2 倍遅くなり、問題のサイズが大きくなるにつれてギャップが少し遅くなります (~1.2)。

score 0 · Accepted Answer

私はあなたの完全なコードをダウンロードし、コンパイルして実行します! 走りを止めず、指導に

if(iter % 100 == 0) printf("%5d, %0.6f\n", iter, norm_L2);

結果は次のとおりです。

100、ナン

200ナン

....

Real型のすべての変数をfloat型に変更したところ、結果は次のようになりました。

100、0.000654

200、0.000370

..., ....

..., ....

8800, 0.000002

8900, 0.000002

9000、0.000001

9100、0.000001

9200, 0.000001

9300, 0.000001

9400, 0.000001

9500, 0.000001

9600, 0.000001

9700, 0.000001

CPU

反復: 9796

合計時間: 5.594017 秒

NUM = 1024 の場合、結果は次のようになりました。

反復: 27271

合計時間: 25.949905 秒

gpu - OpenACC red-black Gauss-Seidel は CPU より遅い

2 に答える 2

Related

Reference