memory - CUDA - 同期スレッド -> 最初の書き込みが完了するまで待機

Question

私は次のことをしようとしています（単純化）：編集セクションを読んでください！

__shared__ int currentPos = 0;
__global__ myThreadedFunction(float *int, float *out)
{
    // do calculations with in values
    ...

    // now first thread reach this:
    //suspend other threads up here

    out += currentPos;
    for (int i = 0; i < size; ++i)
    {
        *(currentPos++) =  calculation[i];
    }
    currentPos +=  size;

    // now thread is finish, other threads can
    // go on with writing
}

では、同じメモリに書き込む前にスレッドを一時停止するにはどうすればよいですか? 各計算配列のサイズ (calculation[i] - サイズ) がわからないため、同時に書き込むことはできません。

syncthreadsとthreadfenceがあることは知っていますが、この問題に対してそれらを正しく使用する方法がわかりません。

編集： 私がやりたいことは：

2 つのスレッドがあります (たとえば)。各スレッドは、float * を新しい配列で計算しています。

計算されたスレッド 1: { 1, 3, 2, 4 }

計算されたスレッド 2: { 3, 2, 5, 6, 3, 4 }

これらの配列のサイズは、計算後にわかります。これらの配列を float *out に書きたいと思います。

最初のスレッド 1 またはスレッド 2 が書き込みを行っている場合、私には必要ありません。出力は次のようになります: * { 1, 3, 2, 4, 3, 2, 5, 6, 3, 4 } または { 3, 2, 5, 6, 3, 4, 1, 3, 2, 4} *

では、出力配列の位置を計算する方法は?

出力が次のようになるように、固定の「配列サイズ」を使用したくありません: * { 1, 3, 2, 4, ?, ?, 3, 2, 5, 6, 3, 4 } *

次の書き込み位置の共有変数 POSITION を使用できると思います。

スレッド 1 は書き込みポイントに到達します (計算後、新しい配列)。スレッド 1 は、共有変数 POSITION 彼の配列サイズ (4) に書き込みます。

スレッド 1 が一時配列を出力配列に書き込んでいる間に、スレッド 2 は変数 POSITION を読み取り、自分の tmp を追加します。配列サイズ (6) をこの変数に代入し、スレッド 1 が終了した位置から書き込みを開始します。

スレッド 3 がある場合、彼は POSITION も読み取り、配列サイズを追加して、スレッド 2 が終了する出力に書き込みます。

それで、誰かアイデアはありますか？

score 2 · Accepted Answer

各スレッドのインデックスを格納するために共有配列を使用して同時出力を行う方法を概念的に説明します。

__global__ myThreadedFunction(float *int, float *out)
{

    __shared__ index[blockDim.x];//replace the size with an constant
    // do calculations with in values
    ...



    index[tid] = size;// assuming size is the size of the array you output
    //you could do a reduction on this for loop for better performance.
    for(int i = 1; i < blockDim.x; ++i) {
        __syncthreads();
        if(tid == i) {
            index[tid] += index[tid-1];
        }
    }
    int startposition = index[tid] - size; // you want to start at the start, not where the index ends

    //do your output for all threads concurrently where startposition is the first index you output to

}

index[tid]したがって、出力するサイズに割り当てます。ここtidで、はスレッドインデックスthreadIdx.xです。次に、配列を上方向に合計し (インデックスを増やします)、最後にindex[tid]、スレッド 0 からの出力配列のオフセット開始インデックスです。合計リダクションを使えば簡単にできます。

score 0 · Accepted Answer

このコードは期待どおりに機能します。同時に読み取りますinput[]。入力要素ごとに、に格納されている順序でからまでの時間をsize書き込みます。sizesizeresultinput[]

書き込み手順は、CPU でこれを行うよりもはるかに時間がかかる場合があることに注意してください。各スレッドが書き込むデータのサイズは既にわかっているため、並列プレフィックス合計を使用して、最初に各スレッドの書き込み位置を計算し、次にデータを同時に書き込むことができます。

コードでの使用に関する詳細については、メモリフェンス関数を参照してください。__threadfence()

#include <thrust/device_vector.h>
#include <thrust/device_ptr.h>

volatile __device__ int count = 0;
volatile __device__ int pos = 0;
__global__ void serial(const float* input, const int N, float* result)
{
    int id = threadIdx.x + blockIdx.x * blockDim.x;

    //parallel part
    int size = (int) input[id];

    //serial output
    for (int i = 0; i < N; i++)
    {
        int localcount = count;
        if (localcount == id)
        {
            int localpos = pos;
            for (int j = 0; j < size; j++)
            {
                result[localpos + j] = (float) j + 1;
            }
            pos = localpos + size;
            count = localcount + 1;
            __threadfence();
        }
        while (count == localcount)
        {
            __syncthreads();
        };

    }
}

int main()
{
    int N = 6;
    thrust::device_vector<float> input(
            thrust::counting_iterator<float>(1),
            thrust::counting_iterator<float>(1) + N);

    thrust::device_vector<float> result(N * (N + 1) / 2);
    serial<<<2, 3>>>(
            thrust::raw_pointer_cast(&input[0]),
            N,
            thrust::raw_pointer_cast(&result[0]));

    thrust::copy(
            result.begin(), result.end(),
            std::ostream_iterator<float>(std::cout, " "));

    return 0;

}

期待どおりの出力:

1 1 2 1 2 3 1 2 3 4 1 2 3 4 5 1 2 3 4 5 6

memory - CUDA - 同期スレッド -> 最初の書き込みが完了するまで待機

2 に答える 2

Related

Reference