pragma - OpenACC データ移動

Question

私は OpenACC を初めて使用し、データの移動と「#pragma acc data」句についてよく理解していません。

C で書かれたプログラムがあります。コードからの抜粋は次のようになります。

#pragma acc data create(intersectionSet[0:intersectionsCount][0:4]) // line 122
#pragma acc kernels // line 123
for (int i = 0; i<intersectionsCount; i++){ // line 124
    intersectionSet[i][0] = 9; // line 125
}

IntersectionsCount の値は 210395 です。上記のコードをコンパイルして実行した後、次のようにします。

pgcc -o rect_openacc -fast -Minfo -acc -ta=nvidia,time rect.c

私はこの出力を持っています：

    time(us): 1,475,607
122: data region reached 1 time
    31: kernel launched 210395 times
        grid: [1]  block: [128]
         device time(us): total=1,475,315 max=15 min=7 avg=7
        elapsed time(us): total=5,451,647 max=24,028 min=24 avg=25
123: compute region reached 1 time
    124: kernel launched 1 time
        grid: [1644]  block: [128]
         device time(us): total=292 max=292 min=292 avg=292
        elapsed time(us): total=312 max=312 min=312 avg=312
156: data region reached 1 time

出力を読んだ後、いくつか質問があります。

行 31 には acc プラグマがないため、なぜ行 31 と言ったのかわかりません。追跡できないということですか？
「31: kernel launched 210395 times」という行に、カーネルを 210395 回起動したと書かれています。この部分は 5,451,647(us) かかり、少し長いと思うので、カーネルを何度も起動する必要があるのが正常かどうかはわかりません。for ループは単純で、それほど時間はかからないと思います。プラグマの使い方が間違っていませんか?

更新プログラム
用のヘッダーファイルがいくつかあります。しかし、これらのファイルには「acc data」または「acc kernels」プラグマがありません。

「-Minfo=all」でコードをコンパイルすると、結果は次のようになります。

breakStringToCharArray:
 11, include "stringHelper.h"
      50, Loop not vectorized/parallelized: contains call
countChar:
 11, include "stringHelper.h"
      74, Loop not vectorized/parallelized: not countable
extractCharToIntRequiredInt:
 11, include "stringHelper.h"
      93, Loop not vectorized/parallelized: contains call
extractArray:
 12, include "fileHelper.h"
      49, Loop not vectorized/parallelized: contains call
isRectOverlap:
 13, include "shapeHelper.h"
      23, Generating acc routine vector
          Generating Tesla code
getRectIntersection:
 13, include "shapeHelper.h"
      45, Generating acc routine vector
          Generating Tesla code
getRectIntersectionInGPU:
 13, include "shapeHelper.h"
      69, Generating acc routine vector
          Generating Tesla code
max:
 13, include "shapeHelper.h"
      98, Generating acc routine vector
          Generating Tesla code
min:
 13, include "shapeHelper.h"
     118, Generating acc routine vector
          Generating Tesla code
main:
64, Loop not vectorized/parallelized: contains call
108, Loop not vectorized/parallelized: contains call
122, Generating create(intersectionSet[:intersectionsCount][:4])
124, Loop is parallelizable
     Accelerator kernel generated
     Generating Tesla code
124, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */

この方法でintersectionSetを作成します：

intersectionSet = (int **)malloc(sizeof(int **) * intersectionsCount);
for (i = 0; i<intersectionsCount; i++){
    intersectionSet[i] = (int *)malloc(sizeof(int *) * 4);
}

score 3 · Accepted Answer

何が起こっているかというと、ポインター配列「**」へのポインターがあるため (少なくとも、intersectionSet はそれだと思います)、コンパイラーは最初にポインターをデバイス上のポインターに割り当て、次に各要素をループして割り当てます。個々のデバイスアレイ。最後に、カーネルを起動して、デバイスにポインター値を設定する必要があります。説明に役立つ疑似コードを次に示します。

devPtrPtr = deviceMalloc(numElements*pointer size);
for (i=0; i < numElements; ++i) {
   devPtr = deviceMalloc(elementSize * dataTypeSize);
   call deviceKernelToSetPointer<<<1,128>>(devPtrPtr[i],devPtr);
}

あなたのコードを助けるために、列の長さを4にし、行の長さを「intersectionsCount」にする次元を切り替えます。これは、デバイスでのデータアクセスにも役立ちます。これは、メモリの発散を避けるために、"ベクトル" ループがストライド 1 (連続) の次元に対応する必要があるためです。

お役に立てれば、

マット

pragma - OpenACC データ移動

1 に答える 1

Related

Reference