c++ - CUDAでビットを(効率的に)パックする方法は?

Question

各バイトが 0 または 1 のバイト配列があります。これらの値をビットにパックして、元の 8 バイトが 1 つのターゲットバイトを占め、元のバイト 0 がビット 0、バイト 1 がビット 1 になるようにします。など。これまでのところ、カーネルには次のものがあります。

const uint16_t tid = threadIdx.x;
__shared__ uint8_t packing[cBlockSize];

// ... Computation of the original bytes in packing[tid]
__syncthreads();

if ((tid & 4) == 0)
{
    packing[tid] |= packing[tid | 4] << 4;
}
if ((tid & 6) == 0)
{
    packing[tid] |= packing[tid | 2] << 2;
}
if ((tid & 7) == 0)
{
    pOutput[(tid + blockDim.x*blockIdx.x)>>3] = packing[tid] | (packing[tid | 1] << 1);
}

これは正しく効率的ですか？

score 8 · Accepted Answer

これ__ballot()には、ワープ投票機能が非常に便利です。pOutput型に再定義できuint32_t、ブロックサイズがワープサイズ (32) の倍数であると仮定します。

unsigned int target = __ballot(packing[tid]);
if (tid % warpSize == 0) {
    pOutput[(tid + blockDim.x*blockIdx.x) / warpSize] = target;
}

厳密に言えば、ワープのすべてのスレッドが同じデータを同じアドレスに書き込むため、if 条件は必要ありません。したがって、高度に最適化されたバージョンは

pOutput[(tid + blockDim.x*blockIdx.x) / warpSize] = __ballot(packing[tid]);

score 1 · Accepted Answer

スレッドごとに 2 ビットの場合、uint2 *pOutput

int lane = tid % warpSize;
uint2 target;
target.x = __ballot(__shfl(packing[tid], lane / 2)                & (lane & 1) + 1));
target.y = __ballot(__shfl(packing[tid], lane / 2 + warpSize / 2) & (lane & 1) + 1));
pOutput[(tid + blockDim.x*blockIdx.x) / warpSize] = target;

これが従来のソリューションよりも高速であるかどうかをベンチマークする必要があります。

c++ - CUDAでビットを(効率的に)パックする方法は?

2 に答える 2

Related

Reference