c# - Parallel.パーティショニング用

Question

次のような場合、パーティショニングはどのように行われますか

Parallel.For(0, buffer.Length, (i)=> buffer[i] = 0);

私の仮定では、nコアマシンの場合、作業は分割n wayさn threadsれ、作業ペイロードを実行します。これは、たとえばbuffer.Length = 100 and n = 4, each thread will get 0-24, 25-49, 50-74, 75-99ブロックを意味します。(100 要素の配列はパーティショニングを説明するための例ですが、数百万のアイテムの配列を考えてください。)

これは公正な仮定ですか？議論してください。

Array.Clear(...)この特定のシナリオでは、はるかに高速に実行されることに気付きました。これをどのように合理化しますか？

score 4 · Accepted Answer

First for the easy part. A 100-element array is so small that it can easily fit in a core's cache. Besides, clearing the array is equivalent to setting a memory area to 0s, something that is available as a CPU command and therefore as fast as you can make it.

In fact, SSE commands and parallel-optimized memory controllers mean that the chipset can probalbly clear memory in parallel using only a single CPU command.

On the other hand, Parallel.For introduces some overhead. It has to partition the data, create the appropriate tasks to work on them, collect the results and return the final result. Below Parallel.For, the runtime has to copy the data to each core, handle memory synchronization, collect the results etc. In your example, this can be significantly larger that the actual time needed to zero the memory locations.

In fact, for small sizes it is quite possible that 99.999% of the overhead is memory synchronization as each core tries to access the same memory page. Remember, memory locking is at the page level and you can fit 2K 16-bit ints in a 4K memory page.

As for how PLINQ schedules tasks - there are many different partitioning schemes used, depending on the operators you use. Check Partitioning in LINQ for a nice intro. In any case, the partitioner will try to determine whether there is any benefit to be gained from partitioning and may not partition the data at all.

In your case, the partitioner will probably use Ranged partitioning. Your payload uses only a few CPU cycles so all you see is the overhead of partitioning, creating tasks, managing synchronization and collecting the results.

A better benchmark would be to run some aggregations on a large array, eg. counts and averages and the like.

score 3 · Accepted Answer

PFX/PLINQ の最適化は複雑です。しかし、これが基本的な写真です...

入力側の最適化:

PLINQ には、入力要素をスレッドに割り当てるための 3 つのパーティション分割戦略があります。

Strategy Element allocationRelative performance
Chunk partitioning         Dynamic                Average
Range partitioning         Static                    Poor to excellent
Hash partitioning           Static                    Poor

要素の比較を必要とするクエリ演算子 ( GroupBy、など) の場合Join、GroupJoinPLINQ は常に、比較的非効率的なハッシュパーティション分割を選択します。これは、すべての要素のハッシュコードを事前に計算する必要があるためです (同じコードを持つ要素を同じスレッドで実行できるようにするため)。

他のすべてのクエリ演算子については、範囲またはチャンクパーティショニングのいずれかを選択できます。既定では、入力シーケンスがインデックス可能である場合 (それがあり、配列がから継承されている場合IList<T>)、PLINQ は範囲分割を選択します。それ以外の場合は、チャンクパーティショニングが選択されます。

範囲分割は、すべての要素が同様の量の CPU 時間を消費する長いシーケンスで高速になります。それ以外の場合は、チャンクパーティショニングの方が高速です。

仕組み:

チャンクパーティショニングは、各ワーカースレッドが定期的に入力シーケンスから要素の小さな「チャンク」を取得して処理することによって機能します。PLINQ は、最初に非常に小さなチャンクを割り当て、クエリが進行するにつれてこの量を増やします。これにより、小さなシーケンスが効果的に並列化され、大きなシーケンスが過度の「ラウンドトリップ」を起こさなくなります。ワーカースレッドがたまたまそのジョブをすばやく終了した場合、より多くのチャンクを取得することになります。このシステムは、すべてのスレッドを均等にビジー状態に保ち、マシンのコアを「バランス」させます。この方法の欠点は、共有入力シーケンスから要素を取得するにはロックが必要であり、これによりオーバーヘッドが追加される可能性があることです。

範囲分割は、通常の入力側の列挙をバイパスし、入力シーケンスでの競合を回避するために、各ワーカースレッドに等しい数の要素を事前に割り当てます。スレッドがこのメソッドを使用して早期に終了すると、他のスレッドが終了するまでアイドル状態になります。

パラレルForとForeach:

既定では、 for For/ Foreachloops PLINQ は範囲パーティション分割を使用します。

これが役立つことを願っています。

c# - Parallel.パーティショニング用

2 に答える 2

Related

Reference