cuda - AoS を使用するカーネルは、SoA を使用するよりも高速です

Question

同じタスクを実行するカーネルの2つのバージョンがあります-リンクされたセルリストを埋めます-両方のカーネルの違いは、粒子の位置を格納するデータ型であり、最初のものは浮動小数点配列を使用して位置を格納します(粒子ごとに4浮動小数点128 ビットの読み取り/書き込み)、2 つ目は vec3f 構造体配列を使用して位置を格納します (3 つの float を保持する構造体)。

nvprof を使用していくつかのテストを行ったところ、2 番目のカーネル (vec3f を使用) が最初のカーネルよりも高速に実行されることがわかりました。

 Time(%)      Time   Calls       Avg       Min       Max  Name
   42.88    37.26s       2    18.63s   23.97us    37.26s  adentu_grid_cuda_filling_kernel(int*, int*, int*, float*, int, _vec3f, _vec3f, _vec3i)
   11.00     3.93s       2     1.97s   25.00us     3.93s  adentu_grid_cuda_filling_kernel(int*, int*, int*, _vec3f*, int, _vec3f, _vec3f, _vec3i)

テストは、256 個と 512000 個の粒子を使用して、リンクされたセルリストを埋めようとして行われます。

私の質問は、ここで何が起こったのですか? float 配列は、メモリーがアラインされていない vec3f 構造体配列を使用するよりも、メモリーが結合されているため、より優れたメモリーアクセスを行うはずだと思いました。私は何かを誤解しましたか？

これらはカーネル、最初のカーネルです:

__global__ void adentu_grid_cuda_filling_kernel (int *head,
                                                 int *linked,
                                                 int *cellnAtoms,
                                                 float *pos, 
                                                 int nAtoms, 
                                                 vec3f origin, 
                                                 vec3f h,
                                                 vec3i nCell)
{
    int idx = threadIdx.x + blockIdx.x * blockDim.x;
    if (idx >= nAtoms)
        return;

    vec3i cell;
    vec3f _pos = (vec3f){(float)pos[idx*4+0], (float)pos[idx*4+1], (float)pos[idx*4+2]};

    cell.x =  floor ((_pos.x - origin.x)/h.x);
    cell.y =  floor ((_pos.y - origin.y)/h.y);
    cell.z =  floor ((_pos.z - origin.z)/h.z);

    int c = nCell.x * nCell.y * cell.z + nCell.x * cell.y + cell.x;

    int i;
    if (atomicCAS (&head[c], -1, idx) != -1){
        i = head[c];
        while (atomicCAS (&linked[i], -1, idx) != -1)
                i = linked[i];
    }
    atomicAdd (&cellnAtoms[c], 1);
}

これが 2 番目のカーネルです。

__global__ void adentu_grid_cuda_filling_kernel (int *head,
                                                 int *linked,
                                                 int *cellNAtoms,
                                                 vec3f *pos,
                                                 int nAtoms,
                                                 vec3f origin,
                                                 vec3f h,
                                                 vec3i nCell)
{
    int idx = threadIdx.x + blockIdx.x * blockDim.x;
    if (idx >= nAtoms)
        return;

    vec3i cell;
    vec3f _pos = pos[idx];

    cell.x = floor ((_pos.x - origin.x)/h.x);
    cell.y = floor ((_pos.y - origin.y)/h.y);
    cell.z = floor ((_pos.z - origin.z)/h.z);

    int c = nCell.x * nCell.y * cell.z + nCell.x * cell.y + cell.x;

    int i;
    if (atomicCAS (&head[c], -1, idx) != -1){
        i = head[c];
        while (atomicCAS (&linked[i], -1, idx) != -1)
                i = linked[i];
    }
    atomicAdd (&cellNAtoms[c], 1);
}

これは vec3f 構造です。

typedef struct _vec3f {float x, y, z} vec3f;

score 5 · Accepted Answer

これは、AoS と SoA の例ではありません。コードの重要な行と、それらに暗黙的に含まれるデータ構造を見てみましょう。

最初の「SoA」または「遅い」ケース:

vec3f _pos = (vec3f){(float)pos[idx*4+0], (float)pos[idx*4+1], (float)pos[idx*4+2]};
                                      ^                    ^                    ^
                                      |                    |                    |
                               These values are stored in *adjacent* memory locations

したがって、個々のスレッドは連続してアクセスしpos[idx*4]、その直後の 2 つの場所にアクセスします。これが構造体の格納方法です。配列の構造体と呼んでいるものは、実際には構造体の配列であり、メモリに格納されています。有効な「SoA」ケースを作成するには、コードを次のようにする必要があります。

vec3f _pos = (vec3f){(float)pos1[idx], (float)pos2[idx], (float)pos3[idx]};
                                 ^
                                 |
               Adjacent threads will read adjacent values for pos1, pos2, and pos3
                    leading to *coalesced* access.

「AoS」または「高速」には、実際には異なるストレージ形式はありません。

score 1 · Accepted Answer

私の考えでは、あなたのアプローチは両方とも実際には AoS です。唯一の違いは、最初のアプローチが 4 つの要素の構造を持つ AoS であるのに対し、2 つ目のアプローチは 3 つの要素のみを使用することです。これが、2番目のソリューションが望ましい理由です。

最初のソリューションで本当に SoA を使用したい場合は、pos 配列を次のように編成する必要があります。

vec3f _pos = (vec3f){(float)pos[idx], (float)pos[N + idx], (float)pos[2 * N + idx]};

cuda - AoS を使用するカーネルは、SoA を使用するよりも高速です

2 に答える 2

Related

Reference