parallel-processing - CUDA 並列ブロックの最適数

Question

各ブロックのスレッド数がすでに CUDA コアの数よりも多い場合、一度に 1 つずつブロックを起動するよりも、ブロックのグリッドを同時に起動するパフォーマンス上の利点はありますか?

score 5 · Accepted Answer

あると思います。スレッドブロックはストリーミングマルチプロセッサ (SM) に割り当てられ、SM は各ブロックのスレッドをさらに 32 スレッドのワープに分割します (新しいアーキテクチャではより大きなワープを処理できます)。これを考慮すると、各計算をブロックに分割して、できるだけ多くの SM を占有する方が高速になります。また、カードがサポートするワープごとのスレッドの倍数であるブロックを構築することも意味します (SM が 32 スレッドワープを使用する場合、40 スレッドではなく 32 または 64 スレッドのブロック)。

score 1 · Accepted Answer

Launch Latency

Launch latency (API call to work is started on the GPU) is of a grid is 3-8 µs on Linux to 30-80 µs on Windows Vista/Win7.

Distributing a block to a SM is 10-100s ns.

Launching a warp in a block (32 threads) is a few cycles and happens in parallel on each SM.

Resource Limitations

Concurrent Kernels - Tesla N/A only 1 grid at a time - Fermi 16 grids at a time - Kepler 16 grids (Kepler2 32 grids)

Maximum Blocks (not considering occupancy limitations) - Tesla SmCount * 8 (gtx280 = 30 * 8 = 240) - Fermi SmCount * 16 (gf100 = 16 * 16 = 256) - Kepler SmCount * 16 (gk104 = 8 * 16 = 128)

See occupancy calculator for limitations on threads per block, threads per SM, registers per SM, registers per thread, ...

Warps Scheduling and CUDA Cores

CUDA cores are floating point/ALU units. Each SM has other types of execution units including load/store, special function, branch, etc. A CUDA core is equivalent to a SIMD unit in a x86 processor. It is not equivalent to a x86 core.

Occupancy is the measure of warps per SM to the maximum number of warps per SM. The more warps per SM the higher the chance that the warp scheduler has an eligible warp to schedule. However, the higher the occupancy the less resources will be available per thread. As a basic goal you want to target more than

25% 8 warps on Tesla 50% or 24 warps on Fermi 50% or 32 warps on Kepler (generally higher)

You'll notice there is no real relationship to CUDA cores in these calculations.

To understand this better read the Fermi whitepaper and if you can use the Nsight Visual Studio Edition CUDA Profiler look at the Issue Efficiency Experiment (not yet available in the CUDA Profiler or Visual Profiler) to understand how well your kernel is hiding execution and memory latency.

parallel-processing - CUDA 並列ブロックの最適数

2 に答える 2

Related

Reference