assembly - AVX VMOVDQA slower than two SSE MOVDQA?

Question

While I was working on my fast ADD loop (Speed up x64 assembler ADD loop), I was testing memory access with SSE and AVX instructions. To add I have to read two inputs and produce one output. So I wrote a dummy routine that reads two x64 values into registers and write one back to memory without doing any operation. This is of course useless, I only did it for benchmarking.

I use an unrolled loop that handles 64 bytes per loop. It is comprised of 8 blocks like this:

mov rax, QWORD PTR [rdx+r11*8-64]
mov r10, QWORD PTR [r8+r11*8-64]
mov QWORD PTR [rcx+r11*8-64], rax

Then I upgraded it to SSE2. Now I use 4 blocks like this:

movdqa xmm0, XMMWORD PTR [rdx+r11*8-64]
movdqa xmm1, XMMWORD PTR [r8+r11*8-64]
movdqa XMMWORD PTR [rcx+r11*8-64], xmm0

And later on I used AVX (256 bit per register). I have 2 blocks like this:

vmovdqa ymm0, YMMWORD PTR [rdx+r11*8-64]
vmovdqa ymm1, YMMWORD PTR [r8+r11*8-64]
vmovdqa YMMWORD PTR [rcx+r11*8-64], ymm0

So far, so not-so-extremely-spectacular. What is interesting is the benchmarking result: When I run the three different approaches on 1k+1k=1k 64-bit words (i.e. two times 8 kb of input and one time 8kb of output) I get strange results. Each of the following timings is for processing two times 64 bytes input into 64 bytes of output.

The x64 register method runs at about 15 cycles/64 bytes
The SSE2 method runs at about 8.5 cycles/64 bytes
The AVX method runs at about 9 cycles/64 bytes

My question is: how come the AVX method is slower (though not a lot) than the SSE2 method? I expected it to be at least on par. Does using the YMM registers cost so much extra time? The memory was aligned (you get GPF's otherwise).

Does anyone have an explanation for this?

score 14 · Accepted Answer

Sandybridge / Ivybridgeでは、256bAVXのロードとストアが2つの128bopsに分割されます[PeterCordesが指摘しているように、これらは完全にµopsではありませんが、ポートをクリアする操作には2サイクルが必要です]。したがって、これらの手順を使用するバージョンがはるかに高速であると期待する理由はありません。

なぜ遅いのですか？2つの可能性が思い浮かびます。

ベース+インデックス+オフセットアドレッシングの場合、128bロードのレイテンシーは6サイクルですが、256bロードのレイテンシーは7サイクルです（インテル最適化マニュアルの表2-8）。ベンチマークはレイテンシーではなくスループットに拘束される必要がありますが、レイテンシーが長いということは、プロセッサーが一時的な中断（パイプラインストール、予測ミス、サービスの中断など）からの回復に時間がかかることを意味します。
同じドキュメントの11.6.2で、Intelは、キャッシュラインとページクロッシングのペナルティは、128bロードの場合よりも256bロードの場合の方が大きくなる可能性があることを示唆しています。ロードがすべて32バイトにアラインされていない場合、これは256bロード/ストア操作を使用するときに見られる速度低下を説明している可能性もあります。

例11-12は、アドレスが整列されていないSAXPYの2つの実装を示しています。代替案1は32バイトの負荷を使用し、代替案2は16バイトの負荷を使用します。これらのコードサンプルは、32バイトアラインメントから4バイトオフセットされた2つのソースバッファーsrc1、src2と、32バイトアラインメントされた宛先バッファーDSTを使用して実行されます。32バイトのメモリアクセスの代わりに2つの16バイトのメモリ操作を使用すると、パフォーマンスが向上します。

assembly - AVX VMOVDQA slower than two SSE MOVDQA?

1 に答える 1

Related

Reference