c++ - C++ と MPI コードの一部を並列に記述する方法は?

Question

私は PETSc ライブラリを使用していくつかのコードを書いてきましたが、今はその一部を変更して並列で実行できるようにします。私が並列化したいもののほとんどは、行列の初期化と、大量の値を生成して計算する部分です。とにかく、コードを複数のコアで実行すると、コードのすべての部分が使用するコアの数だけ実行されるという問題が発生します。

これは、私が PETSc と MPI をテストした単純なサンプルコードです。

int main(int argc, char** argv)
{
    time_t rawtime;
    time ( &rawtime );
    string sta = ctime (&rawtime);
    cout << "Solving began..." << endl;

PetscInitialize(&argc, &argv, 0, 0);

  Mat            A;        /* linear system matrix */
  PetscInt       i,j,Ii,J,Istart,Iend,m = 120000,n = 3,its;
  PetscErrorCode ierr;
  PetscBool      flg = PETSC_FALSE;
  PetscScalar    v;
#if defined(PETSC_USE_LOG)
  PetscLogStage  stage;
#endif

  /* - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
         Compute the matrix and right-hand-side vector that define
         the linear system, Ax = b.
     - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - */
  /* 
     Create parallel matrix, specifying only its global dimensions.
     When using MatCreate(), the matrix format can be specified at
     runtime. Also, the parallel partitioning of the matrix is
     determined by PETSc at runtime.

     Performance tuning note:  For problems of substantial size,
     preallocation of matrix memory is crucial for attaining good 
     performance. See the matrix chapter of the users manual for details.
  */
  ierr = MatCreate(PETSC_COMM_WORLD,&A);CHKERRQ(ierr);
  ierr = MatSetSizes(A,PETSC_DECIDE,PETSC_DECIDE,m,n);CHKERRQ(ierr);
  ierr = MatSetFromOptions(A);CHKERRQ(ierr);
  ierr = MatMPIAIJSetPreallocation(A,5,PETSC_NULL,5,PETSC_NULL);CHKERRQ(ierr);
  ierr = MatSeqAIJSetPreallocation(A,5,PETSC_NULL);CHKERRQ(ierr);
  ierr = MatSetUp(A);CHKERRQ(ierr);

  /* 
     Currently, all PETSc parallel matrix formats are partitioned by
     contiguous chunks of rows across the processors.  Determine which
     rows of the matrix are locally owned. 
  */
  ierr = MatGetOwnershipRange(A,&Istart,&Iend);CHKERRQ(ierr);

  /* 
     Set matrix elements for the 2-D, five-point stencil in parallel.
      - Each processor needs to insert only elements that it owns
        locally (but any non-local elements will be sent to the
        appropriate processor during matrix assembly). 
      - Always specify global rows and columns of matrix entries.

     Note: this uses the less common natural ordering that orders first
     all the unknowns for x = h then for x = 2h etc; Hence you see J = Ii +- n
     instead of J = I +- m as you might expect. The more standard ordering
     would first do all variables for y = h, then y = 2h etc.

   */
PetscMPIInt    rank;        // processor rank
PetscMPIInt    size;        // size of communicator
MPI_Comm_rank(PETSC_COMM_WORLD,&rank);
MPI_Comm_size(PETSC_COMM_WORLD,&size);

cout << "Rank = " << rank << endl;
cout << "Size = " << size << endl;

cout << "Generating 2D-Array" << endl;

double temp2D[120000][3];
 for (Ii=Istart; Ii<Iend; Ii++) { 
    for(J=0; J<n;J++){
      temp2D[Ii][J] = 1;
    }
  }
  cout << "Processor " << rank << " set values : " << Istart << " - " << Iend << " into 2D-Array" << endl;

  v = -1.0;
  for (Ii=Istart; Ii<Iend; Ii++) { 
    for(J=0; J<n;J++){
       MatSetValues(A,1,&Ii,1,&J,&v,INSERT_VALUES);CHKERRQ(ierr);
   }
  }
  cout << "Ii = " << Ii << " processor " << rank << " and it owns: " << Istart << " - " << Iend << endl;

  /* 
     Assemble matrix, using the 2-step process:
       MatAssemblyBegin(), MatAssemblyEnd()
     Computations can be done while messages are in transition
     by placing code between these two statements.
  */
  ierr = MatAssemblyBegin(A,MAT_FINAL_ASSEMBLY);CHKERRQ(ierr);
  ierr = MatAssemblyEnd(A,MAT_FINAL_ASSEMBLY);CHKERRQ(ierr);

    MPI_Finalize();
cout << "No more MPI" << endl;
return 0;

}

私の実際のプログラムには、いくつかの異なる .cpp ファイルがあります。メインプログラムで MPI を初期化し、別の .cpp ファイルの関数を呼び出します。このファイルでは、同じ種類の行列充填を実装しましたが、行列を充填する前にプログラムが実行するすべての cout は、コアの数と同じ回数出力されます。

テストプログラムを mpiexec -n 4 test として実行でき、正常に実行されますが、何らかの理由で実際のプログラムを mpiexec -n 4 ./myprog として実行する必要があります

私のテストプログラムの出力は次のとおりです

Solving began...
Solving began...
Solving began...
Solving began...
Rank = 0
Size = 4
Generating 2D-Array
Processor 0 set values : 0 - 30000 into 2D-Array
Rank = 2
Size = 4
Generating 2D-Array
Processor 2 set values : 60000 - 90000 into 2D-Array
Rank = 3
Size = 4
Generating 2D-Array
Processor 3 set values : 90000 - 120000 into 2D-Array
Rank = 1
Size = 4
Generating 2D-Array
Processor 1 set values : 30000 - 60000 into 2D-Array
Ii = 30000 processor 0 and it owns: 0 - 30000
Ii = 90000 processor 2 and it owns: 60000 - 90000
Ii = 120000 processor 3 and it owns: 90000 - 120000
Ii = 60000 processor 1 and it owns: 30000 - 60000
no more MPI
no more MPI
no more MPI
no more MPI

2 つのコメントの後に編集: したがって、私の目標は、これを 20 個のノードがあり、各ノードに 2 つのコアがある小さなクラスターで実行することです。後でこれはスーパーコンピューターで実行する必要があるため、mpi は間違いなく私が行く必要がある方法です。現在、これを 2 つの異なるマシンでテストしています。そのうちの 1 つは 1 プロセッサ / 4 コア、もう 1 つは 4 プロセッサ / 16 コアです。

score 5 · Accepted Answer

MPI は、SPMD/MPMD モデル (単一プログラム複数データ/複数プログラム複数データ) の実装です。MPI ジョブは、協力して問題を解決するために相互にメッセージを交換する同時実行プロセスで構成されます。コードの一部のみを並行して実行することはできません。相互に通信しないが、同時に実行されるコードの部分のみを持つことができます。またはを使用して、アプリケーションを並列モードで起動する必要があります。mpirunmpiexec

コードの一部のみを並列化し、コードを 1 台のマシンでしか実行できないという制限に耐えられる場合、必要なのは MPI ではなく OpenMP です。または、PETSc Web サイトによると、低レベルの POSIX スレッドプログラミングを使用することもできますpthreads。そして、OpenMP はその上に構築されてpthreadsいるため、OpenMP で PETSc を使用することは可能かもしれません。

score 1 · Accepted Answer

Hristoの答えに追加するために、MPIは分散方式で実行されるように構築されています。つまり、完全に別個のプロセスです。それらは異なる物理マシン上にあると想定されているため、別々にする必要があります。1台のマシンで複数のMPIプロセスを実行できます（たとえば、コアごとに1つ）。これはまったく問題ありませんが、MPIにはその共有メモリコンテキストを利用するためのツールがありません。つまり、マトリックスを共有する方法がないため、一部のMPIランク（プロセス）を別のMPIプロセスが所有するマトリックスで機能させることはできません。

x個のMPIプロセスを開始すると、まったく同じプログラムのx個のコピーが実行されます。次のようなコードが必要です

if (rank == 0)
    do something
else
    do something else

異なるプロセスに異なることをさせること。プロセスはメッセージを送信することで相互に通信できますが、すべて同じ正確なバイナリを実行します。コードが分岐していない場合は、同じプログラムのx個のコピーを取得するだけで同じ結果がx回得られます。

c++ - C++ と MPI コードの一部を並列に記述する方法は?

2 に答える 2

Related

Reference