delphi - SSE: FPU よりも SSE の方が質量整数変換 + 乗算が遅い?

Question

私は非常に頻繁に 6 から 8 の符号付き 32 ビット整数を 32 ビット実数に変換する必要があるアプリケーションに取り組んでいます。Delphi コードをカスタムアセンブラーコードに置き換えたところ、非常に驚いたことに、FPU 変換は常に高速で、一部のコンピューターでは SSE 変換よりもかなり高速です。これを示すコードを次に示します。

program Project1;

{$R *.res}

uses
 windows,dialogs,sysutils;

type
 piiii=^tiiii;
 tiiii=record i1,i2,i3,i4:longint; end;
 pssss=^tssss;
 tssss=record s1,s2,s3,s4:single; end;

var
 convert_value:single=13579.02468;

function convert_x87(adata:longint):single;
asm
 mov [esp-4],eax
 fild longint([esp-4])
 fmul [convert_value]
end;

procedure convert_sse(afrom,ato,aconv:pointer);
asm
 CVTDQ2PS xmm0,[eax]
 mulps xmm0,[ecx]
 movaps [edx],xmm0
end;

procedure get_mem(var p1,p2:pointer);
begin
 getmem(p1,31);
 p2:=pointer((longint(p1)+15) and (not 15));
end;

var
 a,b,c,d:cardinal;
 z:single;
 i:piiii;
 s1,s2:pssss;
 w1,w2,w3:pointer;
begin
 b:=gettickcount;
 a:=0;
 repeat
  z:=convert_x87(a);

  inc(a);
 until a=0;
 c:=gettickcount-b;

 get_mem(pointer(w1),pointer(i));
 get_mem(pointer(w2),pointer(s1));
 get_mem(pointer(w3),pointer(s2));

 s1.s1:=convert_value;
 s1.s2:=convert_value;
 s1.s3:=convert_value;
 s1.s4:=convert_value;

 b:=gettickcount;
 i.i1:=0;
 i.i2:=1;
 i.i3:=2;
 i.i4:=3;
 repeat
  convert_sse(i,s2,s1);

  inc(i.i1,4);
  inc(i.i2,4);
  inc(i.i3,4);
  inc(i.i4,4);
 until i.i1=0;
 d:=gettickcount-b;

 freemem(w1);
 freemem(w2);
 freemem(w3);

 showmessage('FPU:'+inttostr(c)+'/SSE:'+inttostr(d));
end.

変換中に再スケーリング (つまり乗算) が必要なため、そこに 1 つ含まれています。使用した値はランダムに選んだものですが、どの値を使用しても結果は同じでした。また、FPU と SSE の丸めにはごくわずかな違いがありますが、この場合は問題になりません。

しかし、そのコードを実行すると、FPU パスが SSE パスより遅くなることはなく、意味がないことがわかります。何が起こっているのか誰にも分かりますか？

編集:アセンブラーでループを使用した別のソースコードを次に示します。結果は本当に興味深いものです。インクリメント命令がコメントアウトされている場合、SSE バージョンは FPU バージョンよりもかなり高速ですが、インクリメント命令が含まれている場合は、ほぼ同じ速度になります。

program Project1;

{$R *.res}

uses
 windows,dialogs,sysutils;

type
 piiii=^tiiii;
 tiiii=record i1,i2,i3,i4:longint; end;
 pssss=^tssss;
 tssss=record s1,s2,s3,s4:single; end;

var
 convert_value:single=13579.02468;

procedure test_convert_x87;
asm
 // init test data
 push ebx
 xor ebx,ebx

 mov [esp-4],$98765432

 // convert and multiply 1 int32 to 1 single
@next_loop:
// inc [esp-4]
 fild longint([esp-4])
 fmul [convert_value]
 fstp single([esp-8])

 // loop
 dec ebx
 jnz @next_loop

 pop ebx
end;

procedure test_convert_sse(afrom,ato,aconv:pointer);
asm
 // init test data
 push ebx
 xor ebx,ebx

 mov [eax+0],$98765432
 mov [eax+4],$98765432
 mov [eax+8],$98765432
 mov [eax+12],$98765432

 // convert and multiply 4 int32 to 4 single
@next_loop:
// inc [eax+0]
// inc [eax+4]
// inc [eax+8]
// inc [eax+12]
 cvtdq2ps xmm0,[eax]
 mulps xmm0,[ecx]
 movaps [edx],xmm0

 // loop
 sub ebx,4
 jnz @next_loop

 pop ebx
end;

procedure get_mem(var p1,p2:pointer);
begin
 getmem(p1,31);
 p2:=pointer((longint(p1)+15) and (not 15));
end;

var
 b,c,d:cardinal;
 i:piiii;
 s1,s2:pssss;
 w1,w2,w3:pointer;
begin
 b:=gettickcount;
 test_convert_x87;
 c:=gettickcount-b;

 get_mem(pointer(w1),pointer(i));
 get_mem(pointer(w2),pointer(s1));
 get_mem(pointer(w3),pointer(s2));

 s1.s1:=convert_value;
 s1.s2:=convert_value;
 s1.s3:=convert_value;
 s1.s4:=convert_value;

 b:=gettickcount;
 test_convert_sse(i,s2,s1);
 d:=gettickcount-b;

 freemem(w1);
 freemem(w2);
 freemem(w3);

 showmessage('FPU:'+inttostr(c)+'/SSE:'+inttostr(d));
end.

score 1 · Accepted Answer

あなたの asm で遅く見える主な点は、レジスターにデータを保持していないことです。連続する4incつのメモリ位置のうち 4 つは正気ではありません。遅かったのも不思議ではありません。特に。次回にもう一度メモリから読み返すだけの場合。ループカウンターベクトルをループの外側に設定し、ベクトルを追加してインクリメントします{ 1, 1, 1, 1 }。

あなたの質問には、32ビットウィンドウの呼び出し規約が何であるか（どの引数がどのレジスタに入るか）についてのリマインダーもありません。

したがって、内側のループは次のようになります。

; *untested*
    movdqa xmm1, [ vector_of_ones ]   ; or pcmpgt same,same -> all 1s, packed right shift by 32bits
    xor ebx, ebx  ; loop counter
;  also broadcast the scale value to xmm4, maybe with shufps
    movdqa   xmm2, [eax]   ; values to be incremented and converted
loop:
    cvtdq2ps xmm0, xmm2
    mulps    xmm0, xmm4  ; scale
    movaps   [edx], xmm0
    paddd    xmm2, xmm1  ; increment counters
    sub      ebx, 4
    jne      loop  ; loop 2^32 times

    ; movdqa    [eax], xmm2   ; store the incremented loop counter?
    ;  Not sure if this was desired, or a side effect of using mem instead of regs.
    ; If you want this to work on an array, put this store in the loop
    ; and use an indexed addressing mode for eax and edx (or increment pointers)

これがループしない関数の場合は、スケールベクトルの設定mulpsが異なります。理想的には、scalearg はベクトルレジスタの下位要素に渡され、そこからshufps何かを使用してブロードキャストする必要があります。Delphi が、GP レジスタが指すメモリ内のようにそれを強制する場合は、movss最初に推測します。コンパイル時定数の場合は、16B ベクトル定数をメモリオペランドとして使用するのmulpsがおそらく最適な方法です。Core2 以降では、128b のロードに 1 サイクルしかかかりません。(ただし、古い CPU の非 AVX ベクターの場合は、位置合わせする必要があります。)

とにかく、あなたのベンチマークで遅かったのはメモリアクセス、特に書き込みだったと思います。サイクルごとに 1 つのストアのみが可能です。Delphi が float 引数をレジスタに渡すことができない場合、それは最悪です。

delphi - SSE: FPU よりも SSE の方が質量整数変換 + 乗算が遅い?

1 に答える 1

Related

Reference