cuda - 「ワープ内のすべてのスレッドが同じ命令を同時に実行する」ことを理解する方法。GPUで？

Question

私はProfessional CUDA C Programmingを読んでおり、GPUアーキテクチャの概要セクションで:

CUDA は、ワープと呼ばれる 32 個のグループでスレッドを管理および実行するために、Single Instruction Multiple Thread (SIMT) アーキテクチャを採用しています。ワープ内のすべてのスレッドは、同じ命令を同時に実行します。各スレッドには独自の命令アドレスカウンターとレジスタ状態があり、独自のデータに対して現在の命令を実行します。各 SM は、割り当てられたスレッドブロックを 32 スレッドワープに分割し、使用可能なハードウェアリソースでの実行をスケジュールします。

SIMT アーキテクチャは、SIMD (Single Instruction, Multiple Data) アーキテクチャに似ています。SIMD と SIMT はどちらも、同じ命令を複数の実行ユニットにブロードキャストすることで並列処理を実装します。主な違いは、SIMD ではベクトル内のすべてのベクトル要素がユニファイド同期グループで一緒に実行される必要があるのに対し、SIMT では同じワープ内の複数のスレッドが独立して実行できることです。ワープ内のすべてのスレッドが同じプログラムアドレスで同時に開始されたとしても、個々のスレッドが異なる動作をする可能性があります。SIMT を使用すると、独立したスカラースレッド用のスレッドレベルの並列コードと、調整されたスレッド用のデータ並列コードを記述できます。SIMT モデルには、SIMD にはない 3 つの重要な機能が含まれています。
➤ 各スレッドには独自の命令アドレスカウンタがあります。
➤ 各スレッドには独自のレジスタ状態があります。
➤ 各スレッドは独立した実行パスを持つことができます。

最初のパラグラフでは " All threads in a warp execute the same instruction at the same time." が言及されていますが、2 番目のパラグラフでは " Even though all threads in a warp start together at the same program address, it is possible for individual threads to have different behavior." と述べられています。それは私を混乱させ、上記のステートメントは矛盾しているように見えます。誰でも説明できますか？

score 6 · Accepted Answer

There is no contradiction. All threads in a warp execute the same instruction in lock-step at all times. To support conditional execution and branching CUDA introduces two concepts in the SIMT model

Predicated execution (See here)
Instruction replay/serialisation (See here)

Predicated execution means that the result of a conditional instruction can be used to mask off threads from executing a subsequent instruction without a branch. Instruction replay is how a classic conditional branch is dealt with. All threads execute all branches of the conditionally executed code by replaying instructions. Threads which do not follow a particular execution path are masked off and execute the equivalent of a NOP. This is the so-called branch divergence penalty in CUDA, because it has a significant impact on performance.

This is how lock-step execution can support branching.

cuda - 「ワープ内のすべてのスレッドが同じ命令を同時に実行する」ことを理解する方法。GPUで？

1 に答える 1

Related

Reference