performance - Performance issues: Single CPU core vs Single CUDA core

Question

I wanted to compare the speed of a single Intel CPU core with the speed of an single nVidia GPU core (ie: a single CUDA code, a single thread). I did implement the following naive 2d image convolution algorithm:

void convolution_cpu(uint8_t* res, uint8_t* img, uint32_t img_width, uint32_t img_height, uint8_t* krl, uint32_t krl_width, uint32_t krl_height)
{
    int32_t center_x = krl_width  / 2;
    int32_t center_y = krl_height / 2;
    int32_t sum;
    int32_t fkx,fky;
    int32_t xx,yy;

    float krl_sum = 0;
    for(uint32_t i = 0; i < krl_width*krl_height; ++i)
        krl_sum += krl[i];
    float nc = 1.0f/krl_sum;

    for(int32_t y = 0; y < (int32_t)img_height; ++y)
    {
        for(int32_t x = 0; x < (int32_t)img_width; ++x)
        {
            sum = 0;

            for(int32_t ky = 0; ky < (int32_t)krl_height; ++ky)
            {
                fky = krl_height - 1 - ky;

                for(int32_t kx = 0; kx < (int32_t)krl_width; ++kx)
                {
                    fkx = krl_width - 1 - kx;

                    yy = y + (ky - center_y);
                    xx = x + (kx - center_x);

                    if( yy >= 0 && yy < (int32_t)img_height && xx >= 0 && xx < (int32_t)img_width )
                    {
                        sum += img[yy*img_width+xx]*krl[fky*krl_width+fkx];
                    }
                }
            }
            res[y*img_width+x] = sum * nc;
        }
    }
}

The algorithm is the same for both CPU and GPU. I also made another GPU version which is almost the same with the above. The only difference is that I am transferring the img and krl arrays to the shared memory before using them.

I used 2 images of dimensions 52x52 each and I got the following performance:

CPU: 10ms
GPU: 1338ms
GPU (smem): 1165ms

The CPU is an Intel Xeon X5650 2.67GHz and the GPU is an nVidia Tesla C2070.

Why do I get such a performance difference? It looks like a single CUDA core is 100 times slower for this particular code! Could someone explain to me why? The reasons I can think of is

the CPU's higher frequency
the CPU does branch prediction.
CPU have better caching mechanisms maybe?

What you think is the major issue that is causing this huge performance difference?

単一の CPU スレッドと単一の GPU スレッドの速度を比較したいことに注意してください。GPU のコンピューティングパフォーマンスを評価しようとしているわけではありません。これが GPU で畳み込みを行う正しい方法ではないことは承知しています。

score 9 · Accepted Answer

説明しようとしていますが、うまくいくかもしれません。

CPU はホストとして機能し、GPU はデバイスとして機能します。

GPU でスレッドを実行するために、CPU はすべてのデータ (計算 + 計算が実行されるデータ) を GPU にコピーします。このコピー時間は常に計算時間よりも長くなります。計算はALU-算術および論理ユニットで実行されるためです。これは一部の指示のみですが、コピーには時間がかかります。

したがって、CPU で 1 つのスレッドのみを実行すると、CPU はすべてのデータを独自のメモリに保持し、キャッシュと分岐予測、プリフェッチ、マイクロオペレーションの並べ替え、10 倍高速な L1、10 倍高速な L2、6 倍多くの命令をディスパッチする機能を備えています。 1 サイクルあたり、コア周波数が 4.6 倍速くなります。

しかし、スレッドを GPU で実行する場合は、まず GPU メモリにデータをコピーします。今度はもっと時間がかかります。次に、GPU コアはクロックサイクルでスレッドのグリッドを実行します。しかし、そのためには、各スレッドが配列の 1 つの項目にアクセスできるようにデータを分割する必要があります。あなたの例では、それは img および krl 配列です。

nvidia GPU で使用できるプロファイラーもあります。コード内の printout や print などのコードが存在する場合はそれらを削除し、exe のプロファイリングを試みます。コピー時間と計算時間の両方がミリ秒で表示されます。

ループの並列化: image_width と image_height を使用してイメージを計算するために 2 つのループを実行すると、命令レベルでカウンターを介して実行されるため、より多くのクロックサイクルが実行されます。しかし、それらを GPU に移植する場合は、threadid.x と threadid.y と、GPU の 1 つのコアで 1 クロックサイクルだけで実行される 16 または 32 のスレッドのグリッドを使用します。これは、より多くの ALU があるため、1 クロックサイクルで 16 または 32 の配列項目を計算することを意味します (依存関係がなく、データが適切に分割されている場合)。

畳み込みアルゴリズムでは CPU でループを維持しましたが、GPU で同じループを実行しても、GPU 1 スレッドが再び CPU 1 スレッドとして機能するため、メリットはありません。また、メモリキャッシュ、メモリコピー、データパーティショニングなどのオーバーヘッドも発生します。

これでお分かりいただけると思います...

performance - Performance issues: Single CPU core vs Single CUDA core

1 に答える 1

Related

Reference