c++ - CUDA アプリで最適な速度でデータを構造化する方法

Question

CUDA を利用して粒子位置の更新を行う単純な粒子システムを作成しようとしています。現在、3 つの float 値で定義された位置と、3 つの float 値で定義された速度を持つオブジェクトを持つパーティクルを定義しています。パーティクルを更新するときは、速度の Y コンポーネントに定数値を追加して重力をシミュレートし、現在の位置に速度を追加して新しい位置を見つけます。メモリ管理の観点からは、データを格納するため、またはオブジェクト指向の方法で構造化するために、float の 2 つの別個の配列を維持する方が適切です。このようなもの：

struct Vector
{
    float x, y, z;
};

struct Particle
{
    Vector position;
    Vector velocity;
};

データのサイズはどちらの方法でも同じようです (フロートごとに 4 バイト、ベクターごとに 3 つのフロート、パーティクルごとに 2 つのベクター、合計 24 バイト)。 GPU は、2 つの代わりに 1 つのメモリコピーステートメントを使用できるためです (そして、長期的には、年齢、寿命、重量/質量、温度など、関連する粒子に関する他のいくつかの情報があるため)。そして、コードの単純な読みやすさと扱いやすさもあり、オブジェクト指向アプローチに傾倒しています。しかし、私が見た例は構造化データを利用していないので、何か理由があるのだろうかと思います。

問題は、データの個々の配列と構造化オブジェクトのどちらが優れているかということです。

score 18 · Accepted Answer

It's common in data parallel programming to talk about "Struct of Arrays" (SOA) versus "Array of Structs" (AOS), where the first of your two examples is AOS and the second is SOA. Many parallel programming paradigms, in particular SIMD-style paradigms, will prefer SOA.

In GPU programming, the reason that SOA is typically preferred is to optimise the accesses to the global memory. You can view the recorded presentation on Advanced CUDA C from GTC last year for a detailed description of how the GPU accesses memory.

The main point is that memory transactions have a minimum size of 32 bytes and you want to maximise the efficiency of each transaction.

With AOS:

position[base + tid].x = position[base + tid].x + velocity[base + tid].x * dt;
//  ^ write to every third address                    ^ read from every third address
//                           ^ read from every third address

With SOA:

position.x[base + tid] = position.x[base + tid] + velocity.x[base + tid] * dt;
//  ^ write to consecutive addresses                  ^ read from consecutive addresses
//                           ^ read from consecutive addresses

In the second case, reading from consecutive addresses means that you have 100% efficiency versus 33% in the first case. Note that on older GPUs (compute capability 1.0 and 1.1) the situation is much worse (13% efficiency).

There is one other possibility - if you had two or four floats in the struct then you could read the AOS with 100% efficiency:

float4 lpos;
float4 lvel;
lpos = position[base + tid];
lvel = velocity[base + tid];
lpos.x += lvel.x * dt;
//...
position[base + tid] = lpos;

Again, check out the Advanced CUDA C presentation for the details.

c++ - CUDA アプリで最適な速度でデータを構造化する方法

1 に答える 1

Related

Reference