c# - 大規模な行列計算に OpenCL でローカルワーカーを使用する

Question

私は Visual Studio C# で OpenCL (OpenCL.NET ライブラリを使用) を使用する初心者で、現在、大規模な 3D 行列を計算するアプリケーションに取り組んでいます。マトリックス内の各ピクセルで、192 の一意の値が計算され、合計されて、そのピクセルの最終的な値が得られます。したがって、機能的には、(161 x 161 x 161) x 192 の 4 次元行列のようなものです。

現在、次のようにホストコードからカーネルを呼び出しています。

//C# host code
...
float[] BigMatrix = new float[161*161*161]; //1-D result array
CLCalc.Program.Variable dev_BigMatrix = new CLCalc.Program.Variable(BigMatrix);
CLCalc.Program.Variable dev_OtherArray = new CLCalc.Program.Variable(otherArray);
//...load some other variables here too.
CLCalc.Program.Variable[] args = new CLCalc.Program.Variable[7] {//stuff...}

//Here, I execute the kernel, with a 2-dimensional worker pool:
BigMatrixCalc.Execute(args, new int[2]{N*N*N,192});
dev_BigMatrix.ReadFromDeviceTo(BigMatrix);

サンプルのカーネルコードは以下に掲載されています。

__kernel void MyKernel(
__global float * BigMatrix
__global float * otherArray
//various other variables...
)
{
    int N = 161; //Size of matrix edges
    int pixel_id = get_global_id(0); //The location of the pixel in the 1D array
    int array_id = get_global_id(1); //The location within the otherArray


    //Finding the x,y,z values of the pixel_id.
    float3 p;
    p.x = pixel_id % N;    
    p.y = ((pixel_id % (N*N))-p.x)/N;
    p.z = (pixel_id - p.x - p.y*N)/(N*N);

    float result;

    //...
    //Some long calculation for 'result' involving otherArray and p...
    //...

    BigMatrix[pixel_id] += result;
}

私のコードは現在動作していますが、このアプリケーションの速度を探していて、ワーカー/グループのセットアップが最適なアプローチであるかどうかはわかりません (つまり、ワーカープールのサイズは 161*161*161 と 192 です)。

効率を高めるために、グローバルワーカープールをローカルワーカーグループに編成する他の例を見てきましたが、それを OpenCL.NET に実装する方法がよくわかりません。また、これがワーカープールに別のディメンションを作成することとどのように違うのかわかりません。

私の質問は、ここでローカルグループを使用できますか。一般に、ローカルグループを使用することは、単に n 次元のワーカープールを呼び出すこととどう違うのでしょうか? (つまり、Execute(args, new int[]{(N*N*N),192}) を呼び出すのに対し、ローカルワークグループのサイズは 192 ですか?)

助けてくれてありがとう！

score 1 · Accepted Answer

私はあなたにいくつかの提案があります：

あなたのコードには競合状態があると思います。コードの最後の行には、BigMatrixの同じ要素が複数の異なる作業項目によって変更されています。
マトリックスが本当に161x161x161である場合、これらのディメンションを唯一のディメンションとして使用するための作業項目がたくさんあります。すでに400万を超える作業項目があります。これは、マシンにとって十分な並列処理であるはずです。その192倍は必要ありません。さらに、個々のピクセルの計算を複数の作業項目に分割しない場合は、最終的な追加を同期する必要はありません。
グローバルな作業サイズが2の大きな累乗の適切な倍数でない場合は、1になるようにパディングしてみてください。ローカル作業サイズとしてNULLを渡したとしても、一部のOpenCL実装は、うまく分割されないグローバルサイズに対して非効率的なローカルサイズを選択します。
アルゴリズムにローカルメモリやバリアが必要ない場合は、ローカルワークグループをほとんどスキップできます。

お役に立てれば！

score 1 · Accepted Answer

メモリアクセスを待っていると、多くのパフォーマンスが失われると思います。私は同様のSOの質問に答えました。私の投稿がお役に立てば幸いです。ご不明な点がございましたらお問い合わせください。

最適化:

あなたのカーネルの私のバージョンの大きな向上は、otherArray をローカルメモリに読み込むことによってもたらされます。
各作業項目は、BigMatrix で 4 つの値を計算します。これは、同じキャッシュラインに同時に書き込むことができることを意味します。実行する作業項目がまだ 1M を超えているため、並列処理の損失は最小限です。

...

#define N 161
#define Nsqr N*N
#define Ncub N*N*N
#define otherSize 192

__kernel void MyKernel(__global float * BigMatrix, __global float * otherArray)
{
    //using 1 quarter of the total size of the matrix
    //this work item will be responsible for computing 4 consecutive values in BigMatrix
    //also reduces global size to (N^3)/4  ~= 1043000 for N=161

    int global_id = get_global_id(0) * 4; //The location of the first pixel in the 1D array
    int pixel_id;
    //array_id won't be used anymore. work items will process BigMatrix[pixel_id] entirely

    int local_id = get_local_id(0); //work item id within the group
    int local_size = get_local_size(0); //size of group


    float result[4]; //result cached for 4 global values
    int i, j;
    float3 p;

    //cache the values in otherArray to local memory
    //now each work item in the group will be able to read the values efficently
    //each element in otherArray will be read a total of N^3 times, so this is important
    //opencl specifies at least 16kb of local memory, so up to 4k floats will work fine
    __local float otherValues[otherSize];
    for(i=local_id; i<otherSize; i+= local_size){
        otherValues[i] = otherArray[i];
    }
    mem_fence(CLK_LOCAL_MEM_FENCE);

    //now this work item can compute the complete result for pixel_id 
    for(j=0;j<4;j++){
        result[j] = 0;
        pixel_id = global_id + j;

        //Finding the x,y,z values of the pixel_id.
        //TODO: optimize the calculation of p.y and p.z
        //they will be the same most of the time for a given work item
        p.x = pixel_id % N;    
        p.y = ((pixel_id % Nsqr)-p.x)/N;
        p.z = (pixel_id - p.x - p.y*N)/Nsqr;

        for(i=0;i<otherSize;i++){
            //...
            //Some long calculation for 'result' involving otherValues[i] and p...
            //...
            //result[j] += ...
        }
    }
    //4 consecutive writes to BigMatrix will fall in the same cacheline (faster)
    BigMatrix[global_id] += result[0];
    BigMatrix[global_id + 1] += result[1];
    BigMatrix[global_id + 2] += result[2];
    BigMatrix[global_id + 3] += result[3];
}

ノート：

グローバルワークサイズは 4 の倍数である必要があります。理想的には、4*workgroupsize の倍数です。これは、各 pixel_id が 0..N^3-1 の範囲内にあるかどうかを確認するエラーチェックがないためです。未処理の要素は、カーネルの実行を待っている間に CPU によってクランチされる可能性があります。
作業グループのサイズはかなり大きくする必要があります。これにより、キャッシュされた値がより頻繁に使用されるようになり、LDS にデータをキャッシュする利点が大きくなります。

コストのかかる除算や剰余演算を避けるために、px/y/z の計算をさらに最適化する必要があります。以下のコードを参照してください。

__kernel void MyKernel(__global float * BigMatrix, __global float * otherArray)   {
int global_id = get_global_id(0) * 4; //The location of the first pixel in the 1D array
int pixel_id = global_id;

int local_id = get_local_id(0); //work item id within the group
int local_size = get_local_size(0); //size of group


float result[4]; //result cached for 4 global values
int i, j;
float3 p;
//Finding the initial x,y,z values of the pixel_id.
p.x = pixel_id % N;    
p.y = ((pixel_id % Nsqr)-p.x)/N;
p.z = (pixel_id - p.x - p.y*N)/Nsqr;

//cache the values here. same as above...

//now this work item can compute the complete result for pixel_id 
for(j=0;j<4;j++){
    result[j] = 0;
//increment the x,y,and z values instead of computing them all from scratch
    p.x += 1;
    if(p.x >= N){
        p.x = 0;
        p.y += 1;
        if(p.y >= N){
            p.y = 0;
            p.z += 1;
        }
    }

    for(i=0;i<otherSize;i++){
        //same i loop as above...
    }
}

c# - 大規模な行列計算に OpenCL でローカル ワーカーを使用する

2 に答える 2

Related

Reference

c# - 大規模な行列計算に OpenCL でローカルワーカーを使用する