glsl - GLSL - 内積のコストは本当に 1 サイクルだけですか?

Question

GLSL でドット積を実行すると 1 サイクルで実行されるという主張がなされる状況にいくつか遭遇しました。例えば：

頂点プロセッサとフラグメントプロセッサは 4 つのベクトルで動作し、加算、乗算、積和、内積などの 4 つのコンポーネントの命令を 1 つのサイクルで実行します。

http://http.developer.nvidia.com/GPUGems2/gpugems2_chapter35.html

コメントのどこかで、次のような主張も見ました。

    dot(value, vec4(.25))

以下に比べて、4 つの値を平均化するより効率的な方法になります。

    (x + y + z + w) / 4.0

ここでも、dot(vec4, vec4) は 1 サイクルで実行されるという主張がありました。

ARBは、内積 (DP3 と DP4) と外積 (XPD)は単一の命令であると言っているようですが、それは、それらが vec4 add を実行するのと同じくらい計算コストがかかるということですか?ステロイドでの積和は、ここで機能していますか? 私は、そのようなものがコンピューターグラフィックスでどのように役立つかを見ることができますが、1 つのサイクルで、それ自体でかなりの数の命令になる可能性があることを実行すると、多くのように聞こえます。

score 12 · Accepted Answer

The question cannot be answered in any definitive way as a whole. How long any operation takes in hardware is not just hardware-specific, but also code specific. That is, the surrounding code can completely mask the performance an operation takes, or it can make it take longer.

In general, you should not assume that a dot product is single-cycle.

However, there are certain aspects that can certainly be answered:

I've also seen a claim in comments somewhere that:

would be a more efficient way to average four values, compared to:

I would expect this to be kinda true, so long as x, y, z, and w are in fact different float values rather than members of the same vec4 (that is, they're not value.x, value.y, etc). If they are elements of the same vector, I would say that any decent optimizing compiler should compile both of these to the same set of instructions. A good peephole optimizer should catch patterns like this.

I say that it is "kinda true", because it depends on the hardware. The dot-product version should at the very least not be slower. And again, if they are elements of the same vector, the optimizer should handle it.

single instructions, but does that mean that those are just as computationally expensive as doing a vec4 add?

You should not assume that ARB assembly has any relation to the actual hardware machine instruction code.

Is there basically some hardware implementation, along the lines of multiply-accumulate on steroids, in play here?

If you want to talk about hardware, it's very hardware-specific. Once upon a time, there was specialized dot-product hardware. This was in the days of so-called "DOT3 bumpmapping" and the early DX8-era of shaders.

However, in order to speed up general operations, they had to take that sort of thing out. So now, for most modern hardware (aka: anything Radeon HD-class or NVIDIA 8xxx or better. So-called DX10 or 11 hardware), dot-products do pretty much what they say they do. Each multiply/add takes up a cycle.

However, this hardware also allows for a lot of parallelism, so you could have 4 separate vec4 dot products happening simultaneously. Each one would take 4 cycles. But, as long as the results of these operations are not used in the others, they can all execute in parallel. And therefore, the four of them total would take 4 cycles.

So again, it's very complicated. And hardware-dependent.

Your best bet is to start with something that is reasonable. Then learn about the hardware you're trying to code towards, and work from there.

score 5 · Accepted Answer

Nicol Bolas は、「ARB アセンブリ」または IR ダンプを調べるという観点から、実用的な回答を処理しました。「4 つの倍数と 3 つの加算をハードウェアで 1 サイクルにする方法は?! それは不可能に思えます。」という質問に対処します。.

重いパイプライン処理を使用すると、どんなに複雑であっても、すべての命令を 1 サイクルのスループットにすることができます。

これをレイテンシの 1 サイクルと混同しないでください。

完全にパイプライン化された実行により、命令はパイプラインの複数のステージに分散できます。パイプラインのすべてのステージが同時に動作します。

各サイクルで、最初のステージは新しい命令を受け入れ、その出力は次のステージに進みます。各サイクルで、パイプラインの最後に結果が出力されます。

乗算レイテンシが 3 サイクル、加算レイテンシが 5 サイクルの仮想コアの 4 次元内積を調べてみましょう。

このパイプラインがベクトル並列処理なしで最悪の方法でレイアウトされた場合、4 回の乗算と 3 回の加算になり、合計 12 + 15 サイクルになり、合計 27 サイクルのレイテンシになります。

これは内積が 27 サイクルかかるということですか? 絶対にありません。サイクルごとに新しいものを開始でき、27 サイクル後に答えを取得できるからです。

内積を 1 つ実行する必要があり、その答えを待たなければならない場合、結果が出るまで 27 サイクルのレイテンシ全体を待つ必要があります。ただし、1000 個の内積を計算する場合は、1027 サイクルかかります。最初の 26 サイクルでは結果がありませんでした。27 番目のサイクルで最初の結果が最後に出てきました。1000 番目の入力が発行された後、最後の結果が最後に出るまでにさらに 26 サイクルかかりました。これにより、内積が「1 サイクル」かかります。

実際のプロセッサでは、さまざまな方法で作業がステージ全体に分散され、多かれ少なかれパイプラインステージが提供されるため、上記で説明したものとはまったく異なる数になる場合がありますが、考え方は変わりません。一般に、ステージごとに行う作業が少ないほど、クロックサイクルは短くなります。

score 0 · Accepted Answer

重要なのは、vec4 を 1 つの命令で「操作」できることです (Intel が 16 バイトのレジスタ操作で行った作業を参照してください。これは、IOS アクセラレーションフレームワークの多くの基礎となっています)。

ベクトルの分割とスウィズリングを開始すると、操作を実行するためのベクトルの「単一のメモリアドレス」がなくなります。

glsl - GLSL - 内積のコストは本当に 1 サイクルだけですか?

3 に答える 3

Related

Reference