io - MPI テキストファイルからの読み込み

Question

私はMPIでプログラミングすることを学んでいますが、この質問に出くわしました。100,000 行/行の .txt ファイルがあるとします。4 つのプロセッサで処理するためにそれらをチャンクするにはどうすればよいですか? つまり、プロセッサ 0 がライン 0 ～ 25000 の処理を処理し、プロセッサ 1 が 25001 ～ 50000 の処理を処理するようにしたいということです。私はいくつかの検索を行い、MPI_File_seek に出くわしましたが、.txt で動作し、後で fscanf をサポートできるかどうかはわかりません。

score 21 · Accepted Answer

テキストは、行 25001 の開始位置 (たとえば) が事前にわからないため、並列処理に適した形式ではありません。そのため、この種の問題は、多くの場合、インデックスを構築するか、各プロセスが読み取る適切な数のチャンクにファイルを分割するなど、何らかの前処理ステップを通じて事前に処理されます。

本当に MPI を介して実行したい場合は、MPI-IO を使用して、テキストファイルのオーバーラップするチャンクをさまざまなプロセッサに読み込むことをお勧めします。オーバーラップは、予想される最長の行よりもはるかに長くなります。各プロセッサは、どこから開始するかについて合意します。たとえば、プロセス N と N+1 によって共有されるオーバーラップ領域の最初 (または最後) の新しい行は、プロセス N が終了し、N+1 が開始する場所であると言えます。

これをいくつかのコードでフォローアップするには、

#include <stdio.h>
#include <mpi.h>
#include <stdlib.h>
#include <ctype.h>
#include <string.h>
    
void parprocess(MPI_File *in, MPI_File *out, const int rank, const int size, const int overlap) {
    MPI_Offset globalstart;
    int mysize;
    char *chunk;
    
    /* read in relevant chunk of file into "chunk",
     * which starts at location in the file globalstart
     * and has size mysize 
     */
    {
        MPI_Offset globalend;
        MPI_Offset filesize;
    
        /* figure out who reads what */
        MPI_File_get_size(*in, &filesize);
        filesize--;  /* get rid of text file eof */
        mysize = filesize/size;
        globalstart = rank * mysize;
        globalend   = globalstart + mysize - 1;
        if (rank == size-1) globalend = filesize-1;
    
        /* add overlap to the end of everyone's chunk except last proc... */
        if (rank != size-1)
            globalend += overlap;
    
        mysize =  globalend - globalstart + 1;
    
        /* allocate memory */
        chunk = malloc( (mysize + 1)*sizeof(char));
    
        /* everyone reads in their part */
        MPI_File_read_at_all(*in, globalstart, chunk, mysize, MPI_CHAR, MPI_STATUS_IGNORE);
        chunk[mysize] = '\0';
    }
    
    
    /*
     * everyone calculate what their start and end *really* are by going 
     * from the first newline after start to the first newline after the
     * overlap region starts (eg, after end - overlap + 1)
     */
    
    int locstart=0, locend=mysize-1;
    if (rank != 0) {
        while(chunk[locstart] != '\n') locstart++;
        locstart++;
    }
    if (rank != size-1) {
        locend-=overlap;
        while(chunk[locend] != '\n') locend++;
    }
    mysize = locend-locstart+1;
    
    /* "Process" our chunk by replacing non-space characters with '1' for
     * rank 1, '2' for rank 2, etc... 
     */
    
    for (int i=locstart; i<=locend; i++) {
        char c = chunk[i];
        chunk[i] = ( isspace(c) ? c : '1' + (char)rank );
    }

    
    /* output the processed file */
    
    MPI_File_write_at_all(*out, (MPI_Offset)(globalstart+(MPI_Offset)locstart), &(chunk[locstart]), mysize, MPI_CHAR, MPI_STATUS_IGNORE);
    
    return;
}
    
int main(int argc, char **argv) {
    
    MPI_File in, out;
    int rank, size;
    int ierr;
    const int overlap = 100;
    
    MPI_Init(&argc, &argv);
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    MPI_Comm_size(MPI_COMM_WORLD, &size);
    
    if (argc != 3) {
        if (rank == 0) fprintf(stderr, "Usage: %s infilename outfilename\n", argv[0]);
        MPI_Finalize();
        exit(1);
    }
    
    ierr = MPI_File_open(MPI_COMM_WORLD, argv[1], MPI_MODE_RDONLY, MPI_INFO_NULL, &in);
    if (ierr) {
        if (rank == 0) fprintf(stderr, "%s: Couldn't open file %s\n", argv[0], argv[1]);
        MPI_Finalize();
        exit(2);
    }
    
    ierr = MPI_File_open(MPI_COMM_WORLD, argv[2], MPI_MODE_CREATE|MPI_MODE_WRONLY, MPI_INFO_NULL, &out);
    if (ierr) {
        if (rank == 0) fprintf(stderr, "%s: Couldn't open output file %s\n", argv[0], argv[2]);
        MPI_Finalize();
        exit(3);
    }
    
    parprocess(&in, &out, rank, size, overlap);
    
    MPI_File_close(&in);
    MPI_File_close(&out);
    
    MPI_Finalize();
    return 0;
}

質問のテキストの狭いバージョンでこれを実行すると、次のようになります。

$ mpirun -n 3 ./textio foo.in foo.out
$ paste foo.in foo.out
Hi guys I am learning to            11 1111 1 11 11111111 11
program in MPI and I came           1111111 11 111 111 1 1111
across this question. Lets          111111 1111 111111111 1111
say I have a .txt file with         111 1 1111 1 1111 1111 1111
100,000 rows/lines, how do          1111111 11111111111 111 11
I chunk them for processing         1 11111 1111 111 1111111111
by 4 processors? i.e. I want        22 2 22222222222 2222 2 2222
to let processor 0 take care        22 222 222222222 2 2222 2222
of the processing for lines         22 222 2222222222 222 22222
0-25000, processor 1 to take        22222222 222222222 2 22 2222
care of 25001-50000 and so          2222 22 22222222222 222 22
on. I did some searching and        333 3 333 3333 333333333 333
did came across MPI_File_seek       333 3333 333333 3333333333333
but I am not sure can it work       333 3 33 333 3333 333 33 3333
on .txt and supports fscanf         33 3333 333 33333333 333333
afterwards.                         33333333333

io - MPI テキストファイルからの読み込み

1 に答える 1

Related

Reference