c++ - CUDA 共有メモリプログラミングが機能しない

Question

全て：

共有メモリが GPU プログラミングプロセスをどのように高速化するかを学んでいます。以下のコードを使用して、各要素の 2 乗値とその左右の隣接要素の平均の 2 乗値を計算しています。コードは実行されますが、結果は期待どおりではありません。

出力された最初の 10 個の結果は 0,1,2,3,4,5,6,7,8,9 ですが、結果は 25,2,8, 18,32,50,72,98,128,162 になると予想しています。

hereを参照して、コードは次のとおりです。

どの部分が悪いのか教えてください。よろしくお願いいたします。

#include <stdio.h>
#include <stdlib.h>
#include <iostream>
#include <cuda.h>

const int N=1024;

 __global__ void compute_it(float *data)
 {
 int tid = threadIdx.x;
 __shared__ float myblock[N];
 float tmp;

 // load the thread's data element into shared memory
 myblock[tid] = data[tid];

 // ensure that all threads have loaded their values into
 // shared memory; otherwise, one thread might be computing
 // on unitialized data.
 __syncthreads();

 // compute the average of this thread's left and right neighbors
 tmp = (myblock[tid>0?tid-1:(N-1)] + myblock[tid<(N-1)?tid+1:0]) * 0.5f;
 // square the previousr result and add my value, squared
 tmp = tmp*tmp + myblock[tid]*myblock[tid];

 // write the result back to global memory
 data[tid] = myblock[tid];
 __syncthreads();
  }

int main (){

char key;

float *a;
float *dev_a;

a = (float*)malloc(N*sizeof(float));
cudaMalloc((void**)&dev_a,N*sizeof(float));

for (int i=0; i<N; i++){
a [i] = i;
}


cudaMemcpy(dev_a, a, N*sizeof(float), cudaMemcpyHostToDevice);

compute_it<<<N,1>>>(dev_a);

cudaMemcpy(a, dev_a, N*sizeof(float), cudaMemcpyDeviceToHost);


for (int i=0; i<10; i++){
std::cout<<a [i]<<",";
}

std::cin>>key;

free (a);
free (dev_a);

score 1 · Accepted Answer

ブロックごとに 1 つのスレッドしかないため、tid は常に 0 になります。

この方法でカーネルを起動してみてください: compute_it<<<1,N>>>(dev_a);

compute_it<<>>(dev_a) の代わりに;

c++ - CUDA 共有メモリ プログラミングが機能しない

2 に答える 2

Related

Reference

c++ - CUDA 共有メモリプログラミングが機能しない