cuda - 推力 - 初期 device_vector

Question

私の質問@Eric Shiyin Kangに返信していただきありがとうございますが、接頭辞「ホスト」または「デバイス」ではなく、いくつかの試行錯誤の後、エラーが「メンバーデータは常に定数」であることがわかりました例を作成します：

struct OP {
    int N;
    __host__ __device__
    OP(const int n): N(n){};

    __host__ __device__
    UI operator()(const UI a) {
        int b = a * N;
        N++;
        return b;
    }
}
thrust::transform(A.begin(), A.end(), B.begin(), OP(2) );

この場合、A が {0, 1, 2, 3, ... } の場合、B は {0, 2, 4, 6, 8} ですが、実際の B は {0, 3(1*( 2+1))、8(2*(3+1))、15(3*(4+1))、....}

この状況の原因がわからないのですが、推力設計のせいですか？誰か教えてくれませんか？

score 1 · Accepted Answer

更新されたQの場合、ホスト変数Nをデバイスコードで更新できません。一般に、並列アルゴリズムで共有変数を何度も更新することは安全ではありません。

実際、開発ベクトルを初期化する最も速い方法は、次のようなオブジェクト構築段階で派手なイテレーターを使用することです。

// v[]={0,2,4,6,8...}
thrust::device_vector<float> v(
        thrust::make_transform_iterator(
                thrust::counting_iterator<float>(0.0),
                _1 * 2.0),
        thrust::make_transform_iterator(
                thrust::counting_iterator<float>(0.0),
                _1 * 2.0) + SIZE);

// u[]={0,3,8,15...}
thrust::device_vector<float> u(
        thrust::make_transform_iterator(
                thrust::counting_iterator<float>(0.0),
                _1 * (_1 + 2.0)),
        thrust::make_transform_iterator(
                thrust::counting_iterator<float>(0.0),
                _1 * (_1 + 2.0)) + SIZE);

後者の方法はデバイスメモリ全体を複数回読み取り/書き込みするため、define-sequence-and-transformの方法よりも数倍高速になりますv。

ラムダ式ファンクターはファンシーイテレーターで使用されるため、上記のコードはThrust1.6.0+でのみ機能することに注意してください。CUDA5.0のThrust1.5.3の場合、ファンクターを明示的に記述する必要があります。

削除された元のQのA。

同様の前に__host__と__device__修飾子の両方を追加できます。operator()()

struct OP {
    __host__ __device__ void operator()(int &a) {
        a *=2;
    }
}

と

struct OP {
    __host__ __device__ int operator()(int a) {
        return a*2;
    }
}

そうしないと、コンパイラはGPU用の適切なデバイスコードを生成しません。

cuda - 推力 - 初期 device_vector

1 に答える 1

Related

Reference