I wanted to compare the speed of a single Intel CPU core with the speed of an single nVidia GPU core (ie: a single CUDA code, a single thread). I did implement the following naive 2d image convolution algorithm:
void convolution_cpu(uint8_t* res, uint8_t* img, uint32_t img_width, uint32_t img_height, uint8_t* krl, uint32_t krl_width, uint32_t krl_height)
{
int32_t center_x = krl_width / 2;
int32_t center_y = krl_height / 2;
int32_t sum;
int32_t fkx,fky;
int32_t xx,yy;
float krl_sum = 0;
for(uint32_t i = 0; i < krl_width*krl_height; ++i)
krl_sum += krl[i];
float nc = 1.0f/krl_sum;
for(int32_t y = 0; y < (int32_t)img_height; ++y)
{
for(int32_t x = 0; x < (int32_t)img_width; ++x)
{
sum = 0;
for(int32_t ky = 0; ky < (int32_t)krl_height; ++ky)
{
fky = krl_height - 1 - ky;
for(int32_t kx = 0; kx < (int32_t)krl_width; ++kx)
{
fkx = krl_width - 1 - kx;
yy = y + (ky - center_y);
xx = x + (kx - center_x);
if( yy >= 0 && yy < (int32_t)img_height && xx >= 0 && xx < (int32_t)img_width )
{
sum += img[yy*img_width+xx]*krl[fky*krl_width+fkx];
}
}
}
res[y*img_width+x] = sum * nc;
}
}
}
The algorithm is the same for both CPU and GPU. I also made another GPU version which is almost the same with the above. The only difference is that I am transferring the img
and krl
arrays to the shared memory before using them.
I used 2 images of dimensions 52x52 each and I got the following performance:
- CPU: 10ms
- GPU: 1338ms
- GPU (smem): 1165ms
The CPU is an Intel Xeon X5650 2.67GHz and the GPU is an nVidia Tesla C2070.
Why do I get such a performance difference? It looks like a single CUDA core is 100 times slower for this particular code! Could someone explain to me why? The reasons I can think of is
- the CPU's higher frequency
- the CPU does branch prediction.
- CPU have better caching mechanisms maybe?
What you think is the major issue that is causing this huge performance difference?
単一の CPU スレッドと単一の GPU スレッドの速度を比較したいことに注意してください。GPU のコンピューティング パフォーマンスを評価しようとしているわけではありません。これが GPU で畳み込みを行う正しい方法ではないことは承知しています。