c - 奇妙なマルチスレッドパフォーマンス

Question

HPC アプリケーションで得られたかなり残念なパフォーマンス結果の真相を突き止めようとしています。私は、Visual Studio 2010 で次のベンチマークを作成しました。これは、アプリケーションの本質 (多くの独立した、高い演算強度の操作) を抽出したものです。

#include "stdafx.h"
#include <math.h>
#include <time.h>
#include <Windows.h>
#include <stdio.h>
#include <memory.h>
#include <process.h>

void makework(void *jnk) {
    double tmp = 0;
    for(int j=0; j<10000; j++) {
        for(int i=0; i<1000000; i++) {
            tmp = tmp+(double)i*(double)i;
        }
    }
    *((double *)jnk) = tmp;
    _endthread();
}

void spawnthreads(int num) {
    HANDLE *hThreads = (HANDLE *)malloc(num*sizeof(HANDLE));
    double *junk = (double *)malloc(num*sizeof(double));
    printf("Starting %i threads... ", num);
    for(int i=0; i<num; i++) {
        hThreads[i] = (HANDLE)_beginthread(makework, 0, &junk[i]);
    }
    int start = GetTickCount();
    WaitForMultipleObjects(num, hThreads, TRUE, INFINITE);
    int end = GetTickCount();
    FILE *fp = fopen("makework.log", "a+");
    fprintf(fp, "%i,%.3f\n", num, (double)(end-start)/1000.0);
    fclose(fp);
    printf("Elapsed time: %.3f seconds\n", (double)(end-start)/1000.0);
    free(hThreads);
    free(junk);
}

int _tmain(int argc, _TCHAR* argv[])
{
    for(int i=1; i<=20; i++) {
        spawnthreads(i);
    }
    return 0;
}

各スレッドでまったく同じ操作を行っているため、(理想的には) 物理コアがいっぱいになるまで約 11 秒かかり、論理ハイパースレッドコアの使用を開始すると 2 倍になるはずです。私のループ変数と結果はレジスタに収まるので、キャッシュの問題はありません。

以下は、Windows Server 2008 を実行している 2 つのテストベッドでの実験結果です。

マシン 1 デュアル Xeon X5690 @ 3.47 GHz -- 12 個の物理コア、24 個の論理コア、Westmere アーキテクチャ

Starting 1 threads... Elapsed time: 11.575 seconds
Starting 2 threads... Elapsed time: 11.575 seconds
Starting 3 threads... Elapsed time: 11.591 seconds
Starting 4 threads... Elapsed time: 11.684 seconds
Starting 5 threads... Elapsed time: 11.825 seconds
Starting 6 threads... Elapsed time: 12.324 seconds
Starting 7 threads... Elapsed time: 14.992 seconds
Starting 8 threads... Elapsed time: 15.803 seconds
Starting 9 threads... Elapsed time: 16.520 seconds
Starting 10 threads... Elapsed time: 17.098 seconds
Starting 11 threads... Elapsed time: 17.472 seconds
Starting 12 threads... Elapsed time: 17.519 seconds
Starting 13 threads... Elapsed time: 17.395 seconds
Starting 14 threads... Elapsed time: 17.176 seconds
Starting 15 threads... Elapsed time: 16.973 seconds
Starting 16 threads... Elapsed time: 17.144 seconds
Starting 17 threads... Elapsed time: 17.129 seconds
Starting 18 threads... Elapsed time: 17.581 seconds
Starting 19 threads... Elapsed time: 17.769 seconds
Starting 20 threads... Elapsed time: 18.440 seconds

マシン 2 Dual Xeon E5-2690 @ 2.90 GHz -- 16 個の物理コア、32 個の論理コア、Sandy Bridge アーキテクチャ

Starting 1 threads... Elapsed time: 10.249 seconds
Starting 2 threads... Elapsed time: 10.562 seconds
Starting 3 threads... Elapsed time: 10.998 seconds
Starting 4 threads... Elapsed time: 11.232 seconds
Starting 5 threads... Elapsed time: 11.497 seconds
Starting 6 threads... Elapsed time: 11.653 seconds
Starting 7 threads... Elapsed time: 11.700 seconds
Starting 8 threads... Elapsed time: 11.888 seconds
Starting 9 threads... Elapsed time: 12.246 seconds
Starting 10 threads... Elapsed time: 12.605 seconds
Starting 11 threads... Elapsed time: 13.026 seconds
Starting 12 threads... Elapsed time: 13.041 seconds
Starting 13 threads... Elapsed time: 13.182 seconds
Starting 14 threads... Elapsed time: 12.885 seconds
Starting 15 threads... Elapsed time: 13.416 seconds
Starting 16 threads... Elapsed time: 13.011 seconds
Starting 17 threads... Elapsed time: 12.949 seconds
Starting 18 threads... Elapsed time: 13.011 seconds
Starting 19 threads... Elapsed time: 13.166 seconds
Starting 20 threads... Elapsed time: 13.182 seconds

私が不可解だと思う側面は次のとおりです。

Westmere マシンでの経過時間が約 6 コアまで一定であり、その後突然ジャンプし、10 スレッドを超えると基本的に一定のままになるのはなぜですか? Windows は、2 番目のプロセッサに移動する前にすべてのスレッドを 1 つのプロセッサに詰め込みますか?
Sandy Bridge マシンでの経過時間が、基本的にスレッド数に比例して約 12 まで増加するのはなぜですか? コアの数を考えると、12 は意味のある数字のようには思えません。

私のベンチマークを改善するための測定/方法に対するプロセッサカウンターに関する考えや提案は大歓迎です。これはアーキテクチャの問題ですか、それとも Windows の問題ですか?

編集：

以下に示すように、コンパイラはいくつかの奇妙なことを行っていたので、上記と同じことを行う独自のアセンブリコードを作成しましたが、メモリアクセスを回避するためにすべての FP 操作を FP スタックに残します。

void makework(void *jnk) {
    register int i, j;
//  register double tmp = 0;
    __asm {
        fldz  // this holds the result on the stack
    }
    for(j=0; j<10000; j++) {
        __asm {
            fldz // push i onto the stack: stack = 0, res
        }
        for(i=0; i<1000000; i++) {
            // tmp += (double)i * (double)i;
            __asm {
                fld st(0)  // stack: i, i, res
                fld st(0)  // stack: i, i, i, res
                fmul       // stack: i*i, i, res
                faddp st(2), st(0) // stack: i, res+i*i
                fld1       // stack: 1, i, res+i*i
                fadd      // stack: i+1, res+i*i
            }
        }
        __asm {
            fstp st(0)   // pop i off the stack leaving only res in st(0)
        }
    }
    __asm {
        mov eax, dword ptr [jnk]
        fstp qword ptr [eax]
    }
//  *((double *)jnk) = tmp;
    _endthread();
}

これは次のように組み立てられます。

013E1002  in          al,dx  
013E1003  fldz  
013E1005  mov         ecx,2710h  
013E100A  lea         ebx,[ebx]  
013E1010  fldz  
013E1012  mov         eax,0F4240h  
013E1017  fld         st(0)  
013E1019  fld         st(0)  
013E101B  fmulp       st(1),st  
013E101D  faddp       st(2),st  
013E101F  fld1  
013E1021  faddp       st(1),st  
013E1023  dec         eax  
013E1024  jne         makework+17h (13E1017h)  
013E1026  fstp        st(0)  
013E1028  dec         ecx  
013E1029  jne         makework+10h (13E1010h)  
013E102B  mov         eax,dword ptr [jnk]  
013E102E  fstp        qword ptr [eax]  
013E1030  pop         ebp  
013E1031  jmp         dword ptr [__imp___endthread (13E20C0h)]

上記のマシン 1 の結果は次のとおりです。

Starting 1 threads... Elapsed time: 12.589 seconds
Starting 2 threads... Elapsed time: 12.574 seconds
Starting 3 threads... Elapsed time: 12.652 seconds
Starting 4 threads... Elapsed time: 12.682 seconds
Starting 5 threads... Elapsed time: 13.011 seconds
Starting 6 threads... Elapsed time: 13.790 seconds
Starting 7 threads... Elapsed time: 16.411 seconds
Starting 8 threads... Elapsed time: 18.003 seconds
Starting 9 threads... Elapsed time: 19.220 seconds
Starting 10 threads... Elapsed time: 20.124 seconds
Starting 11 threads... Elapsed time: 20.764 seconds
Starting 12 threads... Elapsed time: 20.935 seconds
Starting 13 threads... Elapsed time: 20.748 seconds
Starting 14 threads... Elapsed time: 20.717 seconds
Starting 15 threads... Elapsed time: 20.608 seconds
Starting 16 threads... Elapsed time: 20.685 seconds
Starting 17 threads... Elapsed time: 21.107 seconds
Starting 18 threads... Elapsed time: 21.451 seconds
Starting 19 threads... Elapsed time: 22.043 seconds
Starting 20 threads... Elapsed time: 22.745 seconds

したがって、1 つのスレッドで約 9% 遅くなり (inc eax と fld1 および faddp の違いでしょうか?)、すべての物理コアがいっぱいになると、ほぼ 2 倍遅くなります (これはハイパースレッディングから予想されることです)。しかし、わずか 6 スレッドから始まるパフォーマンスの低下という不可解な側面は依然として残っています...

score 2 · Accepted Answer

今、完全に不自由になり、私自身の質問に答えます-@ us2012が示唆したように、スケジューラのようです。アフィニティマスクをハードコードして、最初に物理コアをいっぱいにしてから、ハイパースレッドコアに切り替えます。

void spawnthreads(int num) {
    ULONG_PTR masks[] = {  // for my system; YMMV
        0x1, 0x4, 0x10, 0x40, 0x100, 0x400, 0x1000, 0x4000, 0x10000, 0x40000, 
        0x100000, 0x400000, 0x2, 0x8, 0x20, 0x80, 0x200, 0x800, 0x2000, 0x8000};
    HANDLE *hThreads = (HANDLE *)malloc(num*sizeof(HANDLE));
    double *junk = (double *)malloc(num*sizeof(double));
    printf("Starting %i threads... ", num);
    for(int i=0; i<num; i++) {
        hThreads[i] = (HANDLE)_beginthread(makework, 0, &junk[i]);
        SetThreadAffinityMask(hThreads[i], masks[i]);
    }
    int start = GetTickCount();
    WaitForMultipleObjects(num, hThreads, TRUE, INFINITE);
    int end = GetTickCount();
    FILE *fp = fopen("makework.log", "a+");
    fprintf(fp, "%i,%.3f,%f\n", num, (double)(end-start)/1000.0, junk[0]);
    fclose(fp);
    printf("Elapsed time: %.3f seconds\n", (double)(end-start)/1000.0);
    free(hThreads);
}

そして得る

Starting 1 threads... Elapsed time: 12.558 seconds
Starting 2 threads... Elapsed time: 12.558 seconds
Starting 3 threads... Elapsed time: 12.589 seconds
Starting 4 threads... Elapsed time: 12.652 seconds
Starting 5 threads... Elapsed time: 12.621 seconds
Starting 6 threads... Elapsed time: 12.777 seconds
Starting 7 threads... Elapsed time: 12.636 seconds
Starting 8 threads... Elapsed time: 12.886 seconds
Starting 9 threads... Elapsed time: 13.057 seconds
Starting 10 threads... Elapsed time: 12.714 seconds
Starting 11 threads... Elapsed time: 12.777 seconds
Starting 12 threads... Elapsed time: 12.668 seconds
Starting 13 threads... Elapsed time: 26.489 seconds
Starting 14 threads... Elapsed time: 26.505 seconds
Starting 15 threads... Elapsed time: 26.505 seconds
Starting 16 threads... Elapsed time: 26.489 seconds
Starting 17 threads... Elapsed time: 26.489 seconds
Starting 18 threads... Elapsed time: 26.676 seconds
Starting 19 threads... Elapsed time: 26.770 seconds
Starting 20 threads... Elapsed time: 26.489 seconds

これは予想どおりです。問題は、ほとんどのコードが MATLAB で記述されているため、これを既定の動作に近づけるためにどの OS 設定を調整できるかということです...

score 1 · Accepted Answer

(考えられる説明) これらのマシンのバックグラウンドアクティビティを確認しましたか? OS がすべてのコアを完全に専用化できない場合があります。マシン 1 では、コアの半分以上を占有し始めると、かなりの成長が始まります。あなたのスレッドは、他のものとリソースを競合する可能性があります。

使用可能なすべてのリソースを取得することを許可していないコンピューター/アカウントの制限とドメインポリシーを確認することもできます。

score 0 · Accepted Answer

さて、メモリ飽和理論を除外したので (ただし、x87? ああ、そこにはあまりパフォーマンスを期待しないでください。SSE/AVX が提供するものに耐えられる場合は、SSE/AVX に切り替えてみてください)。コアのスケーリングはまだ意味があるはずです。使用した CPU モデルを見てみましょう。

これらが正しいモデルであることを検証できますか?

Intel® Xeon® Processor X5690 (12M Cache, 3.46 GHz, 6.40 GT/s Intel® QPI)

http://ark.intel.com/products/52576

Intel® Xeon® Processor E5-2690 (20M Cache, 2.90 GHz, 8.00 GT/s Intel® QPI)

http://ark.intel.com/products/64596/

その場合、最初の物理コアには実際に 6 つの物理コア (12 の論理コア) があり、2 つ目の物理コアには 8 つの物理コア (16 の論理コア) があります。考えてみると、これらの世代では単一のソケットでより多くのコア数を取得できるとは思わないので、それは理にかなっていて、あなたの数に完全に適合しています.

編集: マルチソケットシステムでは、OS は単一のソケットを優先する場合がありますが、論理コアはそこで使用できます。正確なバージョンに依存する可能性がありますが、win server 2008 については、ここに興味深いコメントがあります - http://blogs.technet.com/b/matthts/archive/2012/10/14/windows-server-sockets-logical-プロセッサ-対称-マルチスレッド.aspx

引用：

When the OS boots it starts with socket 1 and enumerates all logical processors:

    on socket 1 it enumerates logical processors 1-20
    on socket 2 it enumerates logical processors 21-40
    on socket 3 it enumerates logical processors 41-60
    on socket 4 it would see 61-64

これが OS がスレッドをウェイクアップする順序である場合、SMT は 2 番目のソケットにスピルオーバーする前に起動する可能性があります。

c - 奇妙なマルチスレッド パフォーマンス

4 に答える 4

Related

Reference

c - 奇妙なマルチスレッドパフォーマンス