memory - cuda: カーネル中の意図しないメモリ変更

Question

私は初心者の cuda プログラマーです。

Nvidia パーティクルシステムの例 (立方体に多くのボール) に似たアプリケーションを構築しようとしています。

私は以下のようなカーネルランチャー機能を持っています:

void Ccuda:: sort_Particles_And_Find_Cell_Start (int  *Cell_Start,          // output
                                                             int *Cell_End,                     // output
                                                             float3 *Sorted_Pos,                // output
                                                             float3 *Sorted_Vel,                //output
                                                             int  *Particle_Cell,                   // input
                                                             int  *Particle_Index,          // input
                                                             float3 *Old_Pos,
                                                             float3 *Old_Vel,
                                                             int   Num_Particles, 
                                                             int Num_Cells)
 {
     int numThreads, numBlocks;

     /*Cell_Start = (int*) cudaAlloc (Num_Cells, sizeof(int));
     Cell_End = (int*) cudaAlloc (Num_Cells, sizeof(int));
     Sorted_Pos = (float3*) cudaAlloc (Num_Particles, sizeof(int));
     Sorted_Vel = (float3*) cudaAlloc (Num_Particles, sizeof(int));*/

    int *h_p_cell = (int *) malloc (Num_Particles * sizeof (int));
    cudaMemcpy (h_p_cell,Particle_Cell, Num_Particles*sizeof(int),cudaMemcpyDeviceToHost);
    free (h_p_cell);

    computeGridSize(Num_Particles, 512, numBlocks, numThreads);

    sort_Particles_And_Find_Cell_StartD<<<numBlocks, numThreads>>>(Cell_Start,Cell_End, Sorted_Pos, Sorted_Vel, Particle_Cell, Particle_Index, Old_Pos, Old_Vel, Num_Particles);

    h_p_cell = (int *) malloc (Num_Particles * sizeof (int));
    cudaMemcpy (h_p_cell,Particle_Cell, Num_Particles*sizeof(int),cudaMemcpyDeviceToHost);
    free (h_p_cell);
 }

そして、このグローバルカーネル関数:

__global__ void sort_Particles_And_Find_Cell_StartD(int  *Cell_Start,       // output
                                     int *Cell_End,                     // output
                                     float3 *Sorted_Pos,                // output
                                     float3 *Sorted_Vel,                //output
                                     int  *Particle_Cell,       // input
                                     int  *Particle_Index,          // input
                                     float3 *Old_Pos,
                                     float3 *Old_Vel,
                                     int   Num_Particles)
     {
        int hash;
        extern __shared__ int Shared_Hash[];    // blockSize + 1 elements
        int index = blockIdx.x*blockDim.x + threadIdx.x;

        if (index < Num_Particles)
        { 
             hash = Particle_Cell[index];
             Shared_Hash[threadIdx.x+1] = hash;

            if (index > 0 && threadIdx.x == 0)
            {
                // first thread in block load previous particle hash
                Shared_Hash[0] = Particle_Cell[index-1];
            }
        }

        __syncthreads();

    if (index < Num_Particles)
    {
        // If this particle has a different cell index to the previous
        // particle then it must be the first particle in the cell,
        // so store the index of this particle in the cell.
        // As it isn't the first particle, it must also be the cell end of
        // the previous particle's cell

        if (index == 0 || hash != Shared_Hash[threadIdx.x])     // if its the first thread in the grid or its particle cell index is different from cell index of the previous neighboring thread
        {
            Cell_Start[hash] = index;

            if (index > 0)
                Cell_End[Shared_Hash[threadIdx.x]] = index;
        }

        if (index == Num_Particles - 1)
        {
            Cell_End[hash] = index + 1;
        }

        // Now use the sorted index to reorder the pos and vel data
        int Sorted_Index = Particle_Index[index];
        //float3 pos = FETCH(Old_Pos, Sorted_Index);       // macro does either global read or texture fetch
        //float3 vel = FETCH(Old_Vel, Sorted_Index);       // see particles_kernel.cuh
        float3 pos = Old_Pos[Sorted_Index];
        float3 vel = Old_Vel[Sorted_Index];
        Sorted_Pos[index] = pos;
        Sorted_Vel[index] = vel;
    }

実行中に、アボートが呼び出されたことを示すデバッグエラーメッセージ r6010 が表示されました。

ランチャー関数 (最初のもの) でわかるように、カーネル実行の前後に int *h_p_cell を使用して Particle_Cell コンテンツを表示します。カーネル内では Particle_Cell への割り当てはありませんが、コンテンツが変更されたようです。プログラムの init() 中に cudaMemcpy によって割り当てられた Particle_Cell メモリ。

私はこの問題を解決するために数日間試みましたが、成功せずに誰か助けてもらえますか?

score 1 · Accepted Answer

あなたのカーネルは動的に割り当てられた共有メモリを期待しています:

    extern __shared__ int Shared_Hash[];    // blockSize + 1 elements

ただし、カーネル呼び出しでは何も割り当てていません。

sort_Particles_And_Find_Cell_StartD<<<numBlocks, numThreads>>>(Cell_Start,Cell_End, Sorted_Pos, Sorted_Vel, Particle_Cell, Particle_Index, Old_Pos, Old_Vel, Num_Particles);
                                                           ^
                                                           |
                                                missing shared memory size parameter

起動構成で共有メモリ量を指定する必要があります。おそらく次のようなものが必要です。

sort_Particles_And_Find_Cell_StartD<<<numBlocks, numThreads, ((numThreads+1)*sizeof(int))>>>(Cell_Start,Cell_End, Sorted_Pos, Sorted_Vel, Particle_Cell, Particle_Index, Old_Pos, Old_Vel, Num_Particles);

このエラーにより、カーネルが共有メモリにアクセスしようとすると、カーネルが異常終了します。また、すべての cuda API 呼び出しとカーネル呼び出しでcuda エラーチェックを行う必要があります。あなたのコードにはその証拠はありません。

すべての API エラーを整理したら、コードをで実行しますcuda-memcheck。への予期しない書き込みの理由はParticle_Cell、カーネルからの境界外アクセスが原因である可能性がありますcuda-memcheck。

memory - cuda: カーネル中の意図しないメモリ変更

1 に答える 1

Related

Reference