c++ - この CUDA カーネルが元のコードと異なる結果をもたらすのはなぜですか?

Question

このコードを移植しました：

    if(_layersCount > 1)
    {
        for(int i=_layersCount-2;i>=0;i--)
        {
            for(int j=0;j<_neuronsPerLayerCount[i];j++) // cuda kernel
            {
                localGradients[indexByLayerAndNeuron(i, j)] = 0;

                for(int k=0;k<_neuronsPerLayerCount[i+1];k++)
                {
                    localGradients[indexByLayerAndNeuron(i, j)] += _neuronsInputsWeights[indexByLayerNeuronAndInput(i+1, k, j)]
                                                                    * localGradients[indexByLayerAndNeuron(i+1, k)];
                }

                localGradients[indexByLayerAndNeuron(i, j)] *= derivatives[indexByLayerAndNeuron(i, j)];
            }
        }
    }

CUDA へ:

    if(_layersCount > 1)
    {
        for(int i=_layersCount-2;i>=0;i--)
        {
            // calculateLocalGradientsForAnotherLayers
            blocksCount = floor((double) _neuronsPerLayerCount[i] / threads.x) + 1;
            blocks = dim3(blocksCount, 1);

            calculateLocalGradientsForAnotherLayers <<<blocks, threads>>> (deviceLocalGradients, _neuronsInputsWeights, deviceDerivatives, _neuronsPerLayerCount[i], _neuronsInPreviousLayers[i], _neuronsInPreviousLayers[i+1], _neuronsPerLayerCount[i+1], _inputsInPreviousLayers[i], _inputsInCurrentLayer[i]);
        }
    }

calculateLocalGradientsForAnotherLayers カーネル:

__global__ void calculateLocalGradientsForAnotherLayers(double * localGradients, double * neuronsInputsWeights, double * derivatives, int neuronsCount, int neuronsInPreviousLayers, int neuronsInPreviousLayersWithCurrent, int neuronsInNextLayer, int inputsInPreviousLayers, int inputsInCurrentLayer)
{
    int idx = blockIdx.x * blockDim.x + threadIdx.x;

    if(idx < neuronsCount)
    {
        int neuron = neuronsInPreviousLayers + idx;

        localGradients[neuron] = 0;

        // this to Kernel, then reduce localGradients.
        for(int k=0;k<neuronsInNextLayer;k++)
        {
            localGradients[neuron] += neuronsInputsWeights[inputsInPreviousLayers + k*inputsInCurrentLayer + idx]
                                                            * localGradients[neuronsInPreviousLayersWithCurrent + k];
        }

        localGradients[neuron] *= derivatives[neuron];
    }
}

しかし、小数第 2 位からの結果の違いがわかります。なぜ誤差が大きいのですか？これを除いて、すべてのカーネルは正常に動作します。

私のGPUはNV GF555Mです。倍精度をサポートしています。

score 1 · Accepted Answer

localGradientsカーネルの本体では、配列を介したある種のブロック間同期が必要です。

for(int k=0;k<neuronsInNextLayer;k++)
        {
            localGradients[neuron] += neuronsInputsWeights[inputsInPreviousLayers + k*inputsInCurrentLayer + idx]
                                                            * localGradients[neuronsInPreviousLayersWithCurrent + k];
        }

localGradients同時読み取り/書き込みアクセスは、要素の実際の値を破壊する可能性があります。読み取り/書き込みには同期がないため、ランダムな結果が表示される場合があります。

score 1 · Accepted Answer

問題が見つかりました。代わりに行：

calculateLocalGradientsForAnotherLayers <<<blocks, threads>>> (deviceLocalGradients, _neuronsInputsWeights, deviceDerivatives, _neuronsPerLayerCount[i], _neuronsInPreviousLayers[i], _neuronsInPreviousLayers[i+1], _neuronsPerLayerCount[i+1], _inputsInPreviousLayers[i], _inputsInCurrentLayer[i]);

書くべきです：

calculateLocalGradientsForAnotherLayers <<<blocks, threads>>> (deviceLocalGradients, _neuronsInputsWeights, deviceDerivatives, _neuronsPerLayerCount[i], _neuronsInPreviousLayers[i], _neuronsInPreviousLayers[i+1], _neuronsPerLayerCount[i+1], _inputsInPreviousLayers[i+1], _inputsInCurrentLayer[i+1]);

c++ - この CUDA カーネルが元のコードと異なる結果をもたらすのはなぜですか?

2 に答える 2

Related

Reference