cuda - CUDA スレッド間で非常に不規則なジョブを共有する

Question

グラフトラバーサル (ビタビアルゴリズム) に関連するいくつかのタスクに取り組んでいます。アクティブな状態の圧縮されたセットがあるたびに、いくつかのジョブが各状態で実行され、結果が出力アークを介して各アークの宛先状態に伝播されます。新しいアクティブな状態のセットが構築されます。問題は、出力アークの数が 2 つまたは 3 つから数千まで非常に大きく変化することです。そのため、計算スレッドは非常に非効率的にロードされます。

共有ローカルメモリキューを介してジョブを共有しようとしています

int tx = threaIdx.x;

extern __shared__ int smem[];

int *stateSet_s = smem;                     //new active set
int *arcSet_s = &(smem[Q_LEN]);             //local shared queue
float *scores_s = (float*)&(smem[2*Q_LEN]);

__shared__ int arcCnt;
__shared__ int stateCnt;

if ( tx == 0 )
{
   arcCnt = 0;
   stateCnt = 0;
}

__syncthreads();

//load state index from compacted list of state indexes
int stateId = activeSetIn_g[gtx];

float srcCost = scores_g[ stateId ];
int startId = outputArcStartIds_g[stateId];

int nArcs = outputArcCounts_g[stateId]; //number of outgoing arcs to be propagated (2-3 to thousands)

/////////////////////////////////////////////
/// prepare arc set
/// !!!! that is the troubled code I think !!!!
/// bank conflicts? uncoalesced access?

int myPos = atomicAdd ( &arcCnt, nArcs );

while ( nArcs > 0 ) && ( myPos < Q_LEN ) )
{
    scores_s[myPos] = srcCost;
    arcSet_s[myPos] = startId + nArcs - 1;

    myPos++;
    nArcs--;
}

__syncthreads();

//////////////////////////////////////
/// parallel propagate arc set

if ( arcSet_s[tx] > 0 )
{
   FstArc arc = arcs_g[ arcSet_s[tx] ];
   float srcCost_ = scores_s[tx];

   DoSomeJob ( &srcCost_ );

   int *dst = &(transitionData_g[arc.dst]);

   int old = atomicMax( dst, FloatToInt ( srcCost_ ) );

   ////////////////////////////////
   //// new active set

   if ( old == ILZERO )
   {
      int pos = atomicAdd ( &stateCnt, 1 );
      stateSet_s[ pos ] = arc.dst;
   }
}

/////////////////////////////////////////////
/// transfer new active set from smem to gmem

__syncthreads();

__shared__ int gPos;

if ( tx == 0 )
{
   gPos = atomicAdd ( activeSetOutSz_g, stateCnt );
}

__syncthreads();

if ( tx < stateCnt )
{
    activeSetOut_g[gPos + tx] = stateSet_s[tx];
}

__syncthreads();

ただし、アクティブセットはすべての状態の 10 ～ 15% ですが、アクティブセットが使用されていない場合 (アクティブセット = すべての状態) よりも遅くなります。レジスタンスの圧力が大幅に上昇し、稼働率は低いですが、それについては何もできないと思います。

スレッド間でジョブを共有するためのより効果的な方法があるでしょうか? 3.0 でのワープシャッフル操作について考えてみますが、2.x デバイスを使用する必要があります。

cuda - CUDA スレッド間で非常に不規則なジョブを共有する

1 に答える 1

Related

Reference