c - プロセスの数を増やしても、この並列コードの実行時間が短縮されないのはなぜですか？

Question

編集：申し訳ありませんが、2つの5000x5000行列を乗算することを忘れました。

これは、プロセス数を増やすと時間も増えることを示す出力です。それで、このコードのロジックに問題がありますか？私はそれをウェブから見つけ、名前をmatrixMultiとそのprintfに変更しただけです。論理的ですが、グリッドラボに接続してプロセスの数を増やすと、正しく機能しないように見えます。それで、あなたはどう思いますか？

ターミナルのスクリーンショット

/**********************************************************************************************
 * Matrix Multiplication Program using MPI.
 *
 * Viraj Brian Wijesuriya - University of Colombo School of Computing, Sri Lanka.
 *
 * Works with any type of two matrixes [A], [B] which could be multiplied to produce a matrix [c].
 *
 * Master process initializes the multiplication operands, distributes the muliplication
 * operation to worker processes and reduces the worker results to construct the final output.
 *
 ************************************************************************************************/

#include<stdio.h>
#include<mpi.h>
#define NUM_ROWS_A 5000 //rows of input [A]
#define NUM_COLUMNS_A 5000 //columns of input [A]
#define NUM_ROWS_B 5000 //rows of input [B]
#define NUM_COLUMNS_B 5000 //columns of input [B]
#define MASTER_TO_SLAVE_TAG 1 //tag for messages sent from master to slaves
#define SLAVE_TO_MASTER_TAG 4 //tag for messages sent from slaves to master
void makeAB(); //makes the [A] and [B] matrixes
void printArray(); //print the content of output matrix [C];

int rank; //process rank
int size; //number of processes
int i, j, k; //helper variables
double mat_a[NUM_ROWS_A][NUM_COLUMNS_A]; //declare input [A]
double mat_b[NUM_ROWS_B][NUM_COLUMNS_B]; //declare input [B]
double mat_result[NUM_ROWS_A][NUM_COLUMNS_B]; //declare output [C]
double start_time; //hold start time
double end_time; // hold end time
int low_bound; //low bound of the number of rows of [A] allocated to a slave
int upper_bound; //upper bound of the number of rows of [A] allocated to a slave
int portion; //portion of the number of rows of [A] allocated to a slave
MPI_Status status; // store status of a MPI_Recv
MPI_Request request; //capture request of a MPI_Isend

int main(int argc, char *argv[])
{

    MPI_Init(&argc, &argv); //initialize MPI operations
    MPI_Comm_rank(MPI_COMM_WORLD, &rank); //get the rank
    MPI_Comm_size(MPI_COMM_WORLD, &size); //get number of processes

    /* master initializes work*/
    if (rank == 0) {
        makeAB();
        start_time = MPI_Wtime();
        for (i = 1; i < size; i++) {//for each slave other than the master
            portion = (NUM_ROWS_A / (size - 1)); // calculate portion without master
            low_bound = (i - 1) * portion;
            if (((i + 1) == size) && ((NUM_ROWS_A % (size - 1)) != 0)) {//if rows of [A] cannot be equally divided among slaves
                upper_bound = NUM_ROWS_A; //last slave gets all the remaining rows
            } else {
                upper_bound = low_bound + portion; //rows of [A] are equally divisable among slaves
            }
            //send the low bound first without blocking, to the intended slave
            MPI_Isend(&low_bound, 1, MPI_INT, i, MASTER_TO_SLAVE_TAG, MPI_COMM_WORLD, &request);
            //next send the upper bound without blocking, to the intended slave
            MPI_Isend(&upper_bound, 1, MPI_INT, i, MASTER_TO_SLAVE_TAG + 1, MPI_COMM_WORLD, &request);
            //finally send the allocated row portion of [A] without blocking, to the intended slave
            MPI_Isend(&mat_a[low_bound][0], (upper_bound - low_bound) * NUM_COLUMNS_A, MPI_DOUBLE, i, MASTER_TO_SLAVE_TAG + 2, MPI_COMM_WORLD, &request);
        }
    }
    //broadcast [B] to all the slaves
    MPI_Bcast(&mat_b, NUM_ROWS_B*NUM_COLUMNS_B, MPI_DOUBLE, 0, MPI_COMM_WORLD);

    /* work done by slaves*/
    if (rank > 0) {
        //receive low bound from the master
        MPI_Recv(&low_bound, 1, MPI_INT, 0, MASTER_TO_SLAVE_TAG, MPI_COMM_WORLD, &status);
        //next receive upper bound from the master
        MPI_Recv(&upper_bound, 1, MPI_INT, 0, MASTER_TO_SLAVE_TAG + 1, MPI_COMM_WORLD, &status);
        //finally receive row portion of [A] to be processed from the master
        MPI_Recv(&mat_a[low_bound][0], (upper_bound - low_bound) * NUM_COLUMNS_A, MPI_DOUBLE, 0, MASTER_TO_SLAVE_TAG + 2, MPI_COMM_WORLD, &status);
        for (i = low_bound; i < upper_bound; i++) {//iterate through a given set of rows of [A]
            for (j = 0; j < NUM_COLUMNS_B; j++) {//iterate through columns of [B]
                for (k = 0; k < NUM_ROWS_B; k++) {//iterate through rows of [B]
                    mat_result[i][j] += (mat_a[i][k] * mat_b[k][j]);
                }
            }
        }
        //send back the low bound first without blocking, to the master
        MPI_Isend(&low_bound, 1, MPI_INT, 0, SLAVE_TO_MASTER_TAG, MPI_COMM_WORLD, &request);
        //send the upper bound next without blocking, to the master
        MPI_Isend(&upper_bound, 1, MPI_INT, 0, SLAVE_TO_MASTER_TAG + 1, MPI_COMM_WORLD, &request);
        //finally send the processed portion of data without blocking, to the master
        MPI_Isend(&mat_result[low_bound][0], (upper_bound - low_bound) * NUM_COLUMNS_B, MPI_DOUBLE, 0, SLAVE_TO_MASTER_TAG + 2, MPI_COMM_WORLD, &request);
    }

    /* master gathers processed work*/
    if (rank == 0) {
        for (i = 1; i < size; i++) {// untill all slaves have handed back the processed data
            //receive low bound from a slave
            MPI_Recv(&low_bound, 1, MPI_INT, i, SLAVE_TO_MASTER_TAG, MPI_COMM_WORLD, &status);
            //receive upper bound from a slave
            MPI_Recv(&upper_bound, 1, MPI_INT, i, SLAVE_TO_MASTER_TAG + 1, MPI_COMM_WORLD, &status);
            //receive processed data from a slave
            MPI_Recv(&mat_result[low_bound][0], (upper_bound - low_bound) * NUM_COLUMNS_B, MPI_DOUBLE, i, SLAVE_TO_MASTER_TAG + 2, MPI_COMM_WORLD, &status);
        }
        end_time = MPI_Wtime();
        printf("\nRunning Time = %f\n\n", end_time - start_time);
        printArray();
    }
    MPI_Finalize(); //finalize MPI operations
    return 0;
}

void makeAB()
{
    for (i = 0; i < NUM_ROWS_A; i++) {
        for (j = 0; j < NUM_COLUMNS_A; j++) {
            mat_a[i][j] = i + j;
        }
    }
    for (i = 0; i < NUM_ROWS_B; i++) {
        for (j = 0; j < NUM_COLUMNS_B; j++) {
            mat_b[i][j] = i*j;
        }
    }
}

void printArray()
{
    for (i = 0; i < NUM_ROWS_A; i++) {
        printf("\n");
        for (j = 0; j < NUM_COLUMNS_A; j++)
            printf("%8.2f ", mat_a[i][j]);
    }
    printf("\n\n\n");
    for (i = 0; i < NUM_ROWS_B; i++) {
        printf("\n");
        for (j = 0; j < NUM_COLUMNS_B; j++)
            printf("%8.2f ", mat_b[i][j]);
    }
    printf("\n\n\n");
    for (i = 0; i < NUM_ROWS_A; i++) {
        printf("\n");
        for (j = 0; j < NUM_COLUMNS_B; j++)
            printf("%8.2f ", mat_result[i][j]);
    }
    printf("\n\n");
}

score 5 · Accepted Answer

これは実際にはそれほど驚くべきことではありません。ワーカーが多いほど、通信のオーバーヘッドが大きくなります（作業を分割し、結果を集約して戻す）。したがって、並列化を活用できる十分なワーカーが存在するスイートスポットがよくありますが、ワーカーはそれほど多くありません。通信のオーバーヘッドが問題になり始めていること。コアの数を増やすと、作業を小さくすることによる収穫逓減と、通信のオーバーヘッドの増加が得られます。そのため、並列アプリケーションを作成する場合、オーバーヘッドを最小限に抑えるネットワーク構造を設計するだけでなく、何人のワーカーが最高のパフォーマンスを発揮するかを測定するために多くの作業を行う必要があります。

score 4 · Accepted Answer

投稿されたコードには、実際の正確性の問題がいくつかあります。ランク0からの送信ループを見てみましょう。

    for (i = 1; i < size; i++) {
        //...
        low_bound = (i - 1) * portion;
        upper_bound = low_bound + portion; 

        MPI_Isend(&low_bound, 1, MPI_INT, i, MASTER_TO_SLAVE_TAG, MPI_COMM_WORLD, &request);
        MPI_Isend(&upper_bound, 1, MPI_INT, i, MASTER_TO_SLAVE_TAG + 1, MPI_COMM_WORLD, &request);
        MPI_Isend(&mat_a[low_bound][0], (upper_bound - low_bound) * NUM_COLUMNS_A, MPI_DOUBLE, i, MASTER_TO_SLAVE_TAG + 2, MPI_COMM_WORLD, &request);
    }

あなたはそれをすることはできません。非ブロッキングリクエストを使用する場合は、最終的にリクエストを使用する必要がMPI_Wait()あります。MPI_Test()これにより、リクエストが完了したことをユーザー（およびMPIライブラリ）が認識できるようになります。リソースのリークを回避するためにこれを行う必要がありますが、さらに重要なのは、この場合、送信が行われたことを知る前に、繰り返し上書きlow_boundしていることです。upper_boundワーカータスクが取得しているデータを誰が知っていますか。さらに、毎回リクエストを上書きすることで、リソースのリークを完全に保証します。

これに対処する方法はいくつかあります。最も簡単なのは、上限と下限の単純な配列、および要求の配列を作成することです。

if (rank == 0) {
    makeAB();
    requests     = malloc(size*3*sizeof(MPI_Request));
    low_bounds   = malloc(size*sizeof(int));
    upper_bounds = malloc(size*sizeof(int));

    start_time = MPI_Wtime();
    for (i = 1; i < size; i++) {
        portion = (NUM_ROWS_A / (size - 1));
        low_bounds[i] = (i - 1) * portion;
        if (((i + 1) == size) && ((NUM_ROWS_A % (size - 1)) != 0)) {
            upper_bounds[i] = NUM_ROWS_A; 
        } else {
            upper_bounds[i] = low_bounds[i] + portion; 
        }

        MPI_Isend(&(low_bounds[i]), 1, MPI_INT, i, MASTER_TO_SLAVE_TAG, MPI_COMM_WORLD, &(requests[3*i]));
        MPI_Isend(&(upper_bounds[i]), 1, MPI_INT, i, MASTER_TO_SLAVE_TAG + 1, MPI_COMM_WORLD, &(requests[3*i+1]));
        MPI_Isend(&mat_a[low_bounds[i]][0], (upper_bounds[i] - low_bounds[i]) * NUM_COLUMNS_A, MPI_DOUBLE, i, MASTER_TO_SLAVE_TAG + 2, MPI_COMM_WORLD, &(requests[3*i+2]));
    }
    MPI_Waitall(3*(size-1), &(requests[3]), MPI_STATUS_IGNORE);
    free(requests);

これの良いところは、ランク0がこの情報を保存しているので、作業者は完了時に情報を送り返す必要がなく、ランク0は適切な場所に直接受け取ることができるということです。

    //...
    for (i = low_bound; i < upper_bound; i++) {
        for (j = 0; j < NUM_COLUMNS_B; j++) {
            for (k = 0; k < NUM_ROWS_B; k++) {
                mat_result[i][j] += (mat_a[i][k] * mat_b[k][j]);
            }
        }
    }
    MPI_Send(&mat_result[low_bound][0], (upper_bound - low_bound) * NUM_COLUMNS_B, MPI_DOUBLE, 0, SLAVE_TO_MASTER_TAG + 2, MPI_COMM_WORLD);

//...
if (rank == 0) {
    for (i = 1; i < size; i++) {
        MPI_Recv(&mat_result[low_bounds[i]][0], (upper_bounds[i] - low_bounds[i]) * NUM_COLUMNS_B, MPI_DOUBLE, i, SLAVE_TO_MASTER_TAG + 2, MPI_COMM_WORLD, &status);
    }

ただし、すべてのプロセッサに分散する必要のあるこれらの値の配列がある限り、MPI_Scatter操作を使用できます。これは、ループオーバーセンドよりも一般的に効率的です。

    for (i = 1; i < size; i++) {
        low_bounds[i] = ...
        upper_bounds[i] = ...
    }

    MPI_Scatter(low_bounds,   1, MPI_INT, &low_bound,   1, MPI_INT, 0, MPI_COMM_WORLD);
    MPI_Scatter(upper_bounds, 1, MPI_INT, &upper_bound, 1, MPI_INT, 0, MPI_COMM_WORLD);

理想的には、スキャッターまたはそのバリアントを使用して、A配列も分散します。

MPI_Scatter()のような集合的な操作でMPI_Bcast()あり、次の問題につながります。元のコードには次のようなものがあります。

    //rank 0:
    for (i = 1; i < size; i++ ) {
        //...
        MPI_Isend();
        MPI_Isend();
        MPI_Isend();
    }

    MPI_Bcast();

    // other ranks:
    MPI_Bcast();

    MPI_Recv();
    MPI_Recv();
    MPI_Recv();

集合的およびポイントツーポイント通信のインターリーブは非常に危険であり、デッドロックにつながる可能性があります。ここでは不要です。BcastをScatterおよびRecv（）の後に移動する必要があります（現在は1つのrecvのみ）。これにより、ワーカータスクコードは次のようになります。

    MPI_Scatter(NULL, 1, MPI_INT, &low_bound,   1, MPI_INT, 0, MPI_COMM_WORLD);
    MPI_Scatter(NULL, 1, MPI_INT, &upper_bound, 1, MPI_INT, 0, MPI_COMM_WORLD);

    MPI_Recv(&mat_a[low_bound][0], (upper_bound - low_bound) * NUM_COLUMNS_A, MPI_DOUBLE, 0, MASTER_TO_SLAVE_TAG + 2, MPI_COMM_WORLD, &status);
    MPI_Bcast(&mat_b, NUM_ROWS_B*NUM_COLUMNS_B, MPI_DOUBLE, 0, MPI_COMM_WORLD);

これでほとんどの正確性の問題が解消されますが、スキャッターを使用してA配列を分散し、ランク0を使用して、ワーカータスクが何かを実行するのを待っている間に計算の「公平な共有」を実行することをお勧めします。（これには、プログラムがsize = 1で動作することを意味するという利点があります）。それでは、パフォーマンスの問題を見てみましょう。

問題のサイズを修正するには、プログラムで次のことを行う必要があります。

マトリックスの上限と下限を分散する（2つの短いメッセージまたは2つの集合）
A行列を配布します（（サイズ-1）長いメッセージ、サイズN ^ 2 /（サイズ-1）が2倍になります）
Bマトリックスをブロードキャストします（集合を使用してすべてのタスクにN ^ 2 doubleを送信します）
Aマトリックスを取得します（Aマトリックスを送信するのと同じです）

そして、各タスクはする必要があります

行列積を計算します（N ^ 3 /（サイズ-1）演算）。

各ランクで実行する必要のある実際の計算作業量は、実行中のプロセッサの数が1 /（P-1）になると実際には減少しますが、通信作業量は（Pまたはlg P、依存）。ある時点で、それらがクロスオーバーしてより多くのプロセッサで実行されると、処理速度が低下します。では、そのポイントはどこにありますか？

単一の8コアnehalemノードでクイックスケーリングテストを実行し、IPMを使用して、時間が費やされている場所の単純なカウントを取得すると、次のようになります。

worker  |  running  |          |  MPI
tasks   |    time   |  Speedup |  time
--------+-----------+----------+--------
   1    |  90.85s   |     -    |  45.5s   
   2    |  45.75s   |   1.99x  |  15.4s
   4    |  23.42s   |   3.88x  |  4.93s
   6    |  15.75s   |   5.76x  |  2.51s

This is actually not too bad; the MPI time is actually being spent almost entirely in the MPI_Recv(), which on-node represents the cost of copying the matrix pieces and, for the rank 0 process, waiting for the results to start coming back from the worker tasks. This suggests that having rank 0 do some work, and replacing the linear loop over receives with a gather operation, would be useful optimizations.

Naturally, as you go off-node or to larger number of processors, the communications cost will continue to go up, and scaling will deteriorate.

More minor points:

まず、マスタースレーブは一般に、行列の乗算などの単純な負荷分散を使用して、密結合の数値問題に取り組むにはかなり貧弱な方法です。しかし、これは単なる学習MPIの演習であると想定し、そのままにしておきます。もちろん、MPIベースの行列乗算を行う正しい方法は、SCALAPACK、 Eigenなどの既存の行列乗算ライブラリを使用することです。

第二に、グローバル変数を多用することは一般的に非常に役に立たないが、それはこの質問の範囲を超えている。また、それは必然であり、両方NUM_COLUMNS_Aは必要NUM_ROWS_Bありません。

score 1 · Accepted Answer

プロセスを分離するときは、結果を送信するのにかかる時間と得られる節約のバランスをとる必要があります。あなたの場合、送信する計算は、ローカルで計算するよりも送信に時間がかかると思います。

作業のより大きなチャンクを他のプロセスと共有してみてください。

c - プロセスの数を増やしても、この並列コードの実行時間が短縮されないのはなぜですか？

3 に答える 3

Related

Reference