optimization - OpenCLメモリの最適化-最近隣

Question

私はOpenCLで、ポイントの2つの配列を受け取り、各ポイントの最近傍を計算するプログラムを作成しています。

このための2つのプログラムがあります。それらの1つは、4次元の距離を計算し、もう1つは6次元の距離を計算します。それらは以下のとおりです。

4次元：

kernel void BruteForce(
    global  read_only float4* m,
    global  float4* y,
    global write_only ushort* i,
    read_only uint mx)
{
    int index = get_global_id(0);
    float4 curY = y[index];

    float minDist = MAXFLOAT;
    ushort minIdx = -1;
    int x = 0;
    int mmx = mx;
    for(x = 0; x < mmx; x++)
    {
        float dist = fast_distance(curY, m[x]);
        if (dist < minDist)
        {
            minDist = dist;
            minIdx = x;
        }
    }
    i[index] = minIdx;
    y[index] = minDist;
}

6次元：

kernel void BruteForce(
    global  read_only float8* m,
    global  float8* y,
    global write_only ushort* i,
    read_only uint mx)
{
    int index = get_global_id(0);
    float8 curY = y[index];

    float minDist = MAXFLOAT;
    ushort minIdx = -1;
    int x = 0;
    int mmx = mx;
    for(x = 0; x < mmx; x++)
    {
        float8 mx = m[x];
        float d0 = mx.s0 - curY.s0;
        float d1 = mx.s1 - curY.s1;
        float d2 = mx.s2 - curY.s2;
        float d3 = mx.s3 - curY.s3;
        float d4 = mx.s4 - curY.s4;
        float d5 = mx.s5 - curY.s5;

        float dist = sqrt(d0 * d0 + d1 * d1 + d2 * d2 + d3 * d3 + d4 * d4 + d5 * d5);
        if (dist < minDist)
        {
            minDist = dist;
            minIdx = index;
        }
    }
    i[index] = minIdx;
    y[index] = minDist;
}

このプログラムをGPGPU向けに最適化する方法を探しています。ローカルメモリを使用したGPGPUの最適化に関するいくつかの記事（ソースコードが付属しているhttp://www.macresearch.org/opencl_episode6を含む）を読みました。私はそれを適用しようとしました、そしてこのコードを思いつきました：

kernel void BruteForce(
    global  read_only float4* m,
    global  float4* y,
    global write_only ushort* i,
    __local float4 * shared)
{
    int index = get_global_id(0);
    int lsize = get_local_size(0);
    int lid = get_local_id(0);

    float4 curY = y[index];

    float minDist = MAXFLOAT;
    ushort minIdx = 64000;
    int x = 0;
    for(x = 0; x < {0}; x += lsize)
    {
        if((x+lsize) > {0}) 
            lsize = {0} - x;
        if ( (x + lid) < {0})
        {
            shared[lid] = m[x + lid];
        }
        barrier(CLK_LOCAL_MEM_FENCE);

        for (int x1 = 0; x1 < lsize; x1++)
        {
            float dist = distance(curY, shared[x1]);

            if (dist < minDist)
            {
                minDist = dist;
                minIdx = x + x1;
            }
        }
        barrier(CLK_LOCAL_MEM_FENCE);
    }
    i[index] = minIdx;
    y[index] = minDist;
}

'i'出力のガベージ結果を取得しています（たとえば、同じ値が多数あります）。誰かが私を正しい方向に向けることができますか？このコードを改善するのに役立つ回答、または上記の最適化バージョンの問題を見つけるのに役立つ回答をいただければ幸いです。

どうもありがとうございましたCauê

score 1 · Accepted Answer

ここで大幅に高速化する1つの方法は、ローカルデータ構造を使用して、一度にデータのブロック全体を計算することです。また、単一の読み取り/書き込みグローバルベクトル（float4）のみが必要です。同じアイデアは、より小さなブロックを使用して6dバージョンに適用できます。各ワークグループは、処理しているデータのブロックを自由に処理できます。アプリケーションの詳細を知っているので、正確な実装はあなたに任せます。

いくつかの疑似っぽいコード（4d）：

computeBlockSize is the size of the blocks to read from global and crunch.
this value should be a multiple of your work group size. I like 64 as a WG
size; it tends to perform well on most platforms. will be 
allocating 2 * float4 * computeBlockSize + uint * computeBlockSize of shared memory.
(max value for ocl 1.0 ~448, ocl 1.1 ~896)
#define computeBlockSize = 256 

__local float4[computeBlockSize] blockA;
__local float4[computeBlockSize] blockB;
__local uint[computeBlockSize] blockAnearestIndex;

now blockA gets computed against all blockB combinations. this is the job of a single work group.
*important*: only blockA ever gets written to. blockB is stored in local memory, but never changed or copied back to global

steps:
load blockA into local memory with async_work_group_copy
blockA is located at get_group_id(0) * computeBlockSize in the global vector
optional: set all blockA 'w' values to MAXFLOAT
optional: load blockAnearestIndex into local memory with async_work_group_copy if needed


need to compute blockA against itself first, then go into the blockB's
be careful to only write to blockA[j], NOT blockA[k]. j is exclusive to this work item
for(j=get_local_id(0); j<computeBlockSize;j++)
  for(k=0;k<computeBlockSize; k++)
    if(j==k) continue; //no self-comparison
    calculate distance of blockA[j] vs blockA[k]
    store min distance in blockA[j].w
    store global index (= i*computeBlockSize +k) of nearest in blockAnearestIndex[j]
barrier(local_mem_fence)

for (i=0;i<get_num_groups(0);i++)
  if (i==get_group_id(0)) continue;
  load blockB into local memory: async_work_group_copy(...)
  for(j=get_local_id(0); j<computeBlockSize;j++)
    for(k=0;k<computeBlockSize; k++)
      calculate distance of blockA[j] vs blockB[k]
      store min distance in blockA[j].w
      store global index (= i*computeBlockSize +k) of nearest in blockAnearestIndex[j]
  barrier(local_mem_fence)

write blockA and blockAnearestIndex to global memory using two async_work_group_copy

W値のみが変更されている可能性があるため、別のワークグループが（独自のblockAとして）同じブロックを書き込んでいる間、blockBの読み取りに問題はありません。これに問題が発生した場合、または2つの異なる点のベクトルが必要な場合は、上記のように2つのグローバルベクトルを使用できます。1つはA（書き込み可能）、もう1つはB（読み取り専用）です。

このアルゴリズムは、グローバルデータサイズがcomputeBlockSizeの倍数である場合に最適に機能します。エッジを処理するために、2つの解決策が思い浮かびます。上記と同様の方法で、非正方形のエッジブロック用に2番目のカーネルを作成することをお勧めします。新しいカーネルは最初のカーネルの後に実行でき、2番目のpci-e転送を保存できます。または、距離-1を使用して、2つの要素の比較でスキップを示すことができます（つまり、blockA [j] .w==-1またはblockB[k].w == -1の場合、続行します）。ただし、この2番目の解決策では、カーネルの分岐が大幅に増えるため、新しいカーネルを作成することをお勧めします。データポイントのごく一部が実際にはエッジブロックに分類されます。

optimization - OpenCLメモリの最適化-最近隣

1 に答える 1

Related

Reference