c - Conway のライフゲームを CUDA 用に最適化するには?

Question

コンウェイのライフゲーム用に、次の CUDA カーネルを作成しました。

__global__ void gameOfLife(float* returnBuffer, int width, int height) {  
    unsigned int x = blockIdx.x*blockDim.x + threadIdx.x;  
    unsigned int y = blockIdx.y*blockDim.y + threadIdx.y;  
    float p = tex2D(inputTex, x, y);  
    float neighbors = 0;  
    neighbors += tex2D(inputTex, x+1, y);  
    neighbors += tex2D(inputTex, x-1, y);  
    neighbors += tex2D(inputTex, x, y+1);  
    neighbors += tex2D(inputTex, x, y-1);  
    neighbors += tex2D(inputTex, x+1, y+1);  
    neighbors += tex2D(inputTex, x-1, y-1);  
    neighbors += tex2D(inputTex, x-1, y+1);  
    neighbors += tex2D(inputTex, x+1, y-1);  
    __syncthreads();  
    float final = 0;  
    if(neighbors < 2) final = 0;  
    else if(neighbors > 3) final = 0;  
    else if(p != 0) final = 1;  
    else if(neighbors == 3) final = 1;  
    __syncthreads();  
    returnBuffer[x + y*width] = final;  
}

エラー/最適化を探しています。並列プログラミングは私にとってまったく新しいものであり、それを正しく行う方法がわからない。

残りは、入力配列から CUDA 配列にバインドされた 2D テクスチャ inputTex への memcpy です。出力は、グローバルメモリからホストに memcpy されてから処理されます。

ご覧のとおり、スレッドは単一のピクセルを処理します。一部の情報源は、スレッドごとに行以上を実行することを提案しているため、それが最速の方法であるかどうかはわかりません。私が正しく理解していれば、NVidia自身は、スレッドが多いほど良いと言っています。これについては、実務経験のある方からのアドバイスをいただければ幸いです。

score 11 · Accepted Answer

私の2セント。

全体は、マルチプロセッサと GPU メモリ間の通信の遅延によって制限されるようです。単独で実行するのに 30 ～ 50 クロックティックかかるコードがあり、必要なデータがキャッシュにない場合、それぞれ 200 クロックティック以上かかるメモリアクセスが少なくとも 3 回生成されます。

テクスチャメモリを使用することは、これに対処する良い方法ですが、必ずしも最適な方法とは限りません。

少なくとも、スレッドごとに一度に (水平方向に) 4 ピクセルを実行するようにしてください。グローバルメモリには、一度に 128 バイトにアクセスできます (ワープが 128 バイト間隔で任意のバイトにアクセスしようとしている限り、ほとんど追加コストなしでキャッシュライン全体を取得することもできます)。ワープは 32 スレッドなので、各スレッドを 4 ピクセルで処理すると効率的です。

さらに、垂直方向に隣接するピクセルを同じマルチプロセッサで処理する必要があります。これは、隣接する行が同じ入力データを共有しているためです。ピクセル (x=0,y=0) が 1 つの MP によって処理され、ピクセル (x=0,y=1) が別の MP によって処理される場合、両方の MP がそれぞれ 3 つのグローバルメモリ要求を発行する必要があります。両方が同じ MP によって処理され、結果が (暗黙的または明示的に) 適切にキャッシュされている場合、必要なのは合計 4 つだけです。これは、各スレッドを複数の垂直ピクセルで動作させるか、blockDim.y>1 にすることで実行できます。

より一般的には、各 32 スレッドワープに、MP で利用可能なメモリ (16 ～ 48 kb、または少なくとも 128x128 ブロック) と同じ量のメモリをロードしてから、そのウィンドウ内のすべてのピクセルを処理することをお勧めします。

2.0 より前のコンピューティング互換性のあるデバイスでは、共有メモリを使用する必要があります。コンピューティング互換性 2.0 および 2.1 のデバイスでは、キャッシュ機能が大幅に改善されているため、グローバルメモリで問題ない場合があります。

スレッドごとに 4 ピクセル、ワープごとに 32 スレッドで動作する単純な実装で発生するように、各ワープが入力ピクセルの各水平行で 3 つではなく 2 つのキャッシュラインのみにアクセスすることを確認することで、いくつかの重要な節約が得られる可能性があります。

float をバッファタイプとして使用する正当な理由はありません。メモリ帯域幅が 4 倍になるだけでなく、コードの信頼性が低下し、バグが発生しやすくなります。(たとえばif(neighbors == 3)、float と integer を比較しているので、それが正しく機能することを確信していますか?) unsigned char を使用します。さらに良いことに、 uint8_t を使用し、それが定義されていない場合は unsigned char を意味するように typedef します。

最後に、実験の価値を過小評価しないでください。多くの場合、CUDA コードのパフォーマンスはロジックでは簡単に説明できず、パラメーターを微調整して何が起こるかを確認する必要があります。

score 4 · Accepted Answer

TL;DR: see: http://golly.sourceforge.net

The problem is that most CUDA implementations follow the brain dead idea of manual counting of the neighbors. This is so dead slow that any smart serial CPU implementation will outperform it.

The only sensible way to do GoL calculations is using a lookup table.
The currently fastest implementations on a CPU use lookup a square 4x4 = 16 bit block to see get the future 2x2 cells inside.

in this setup the cells are laid out like so:

 01234567
0xxxxxxxx //byte0
1xxxxxxxx //byte1 
2  etc
3
4
5
6
7

Some bit-shifting is employed to get a 4x4 block to fit into a word and that word is looked up using a lookup table. The lookup tables holds words as well, this way 4 different versions of the outcome can be stored in the lookup table, so you can minimize the amount of bitshifting needed to be done on the input and/or the output.

In addition the different generations are staggered, so that you only have to look at 4 neighboring slabs, instead of 9. Like so:

AAAAAAAA 
AAAAAAAA   BBBBBBBB
AAAAAAAA   BBBBBBBB
AAAAAAAA   BBBBBBBB
AAAAAAAA   BBBBBBBB
AAAAAAAA   BBBBBBBB
AAAAAAAA   BBBBBBBB
AAAAAAAA   BBBBBBBB
           BBBBBBBB
//odd generations (A) are 1 pixel above and to the right of B,
//even generations (B) are 1 pixels below and to the left of A.

This alone results in a 1000x+ speed-up compared to silly counting implementations.

Then there is the optimization of not calculating slabs that are static or have a periodicity of 2.

And then there is HashLife, but that's a different beast altogether.
HashLife can generate Life patterns in O(log n) time, instead of the O(n) time normal implementations can. This allows you to calculate generation: 6,366,548,773,467,669,985,195,496,000 (6 octillion) in mere seconds.
Unfortunately Hashlife requires recursion, and thus is difficult on CUDA.

score 3 · Accepted Answer

このスレッドを見てください。私たちはそこで多くの改善を行いました...

http://forums.nvidia.com/index.php?showtopic=152757&st=60

c - Conway のライフ ゲームを CUDA 用に最適化するには?

3 に答える 3

Related

Reference

c - Conway のライフゲームを CUDA 用に最適化するには?