c++ - C++ でのファイルからの整数読み取りを高速化

Question

ファイルを 1 行ずつ読み取り、そこから整数を抽出しています。いくつかの注目すべき点:

入力ファイルがバイナリ形式ではありません。
ファイル全体をメモリにロードできません。
ファイル形式 (区切り文字で区切られた整数のみ):
```
x1 x2 x3 x4 ...
y1 y2 y3 ...
z1 z2 z3 z4 z5 ...
...
```

context を追加するために、整数を読み取り、を使用してそれらを数えていますstd::unordered_map<unsigned int, unsinged int>。

次のように、単純に行をループして、役に立たない文字列ストリームを割り当てます。

std::fstream infile(<inpath>, std::ios::in);
while (std::getline(infile, line)) {
    std::stringstream ss(line);
}

700MBのファイルで約2.7秒かかります。

各行の解析:

unsigned int item;
std::fstream infile(<inpath>, std::ios::in);
while (std::getline(infile, line)) {
    std::stringstream ss(line);
    while (ss >> item);
}

同じファイルで最大 17.8 秒かかります。

演算子をstd::getline+に変更するとatoi:

unsigned int item;
std::fstream infile(<inpath>, std::ios::in);
while (std::getline(infile, line)) {
    std::stringstream ss(line);
    while (std::getline(ss, token, ' ')) item = atoi(token.c_str());
}

〜14.6秒になります。

これらのアプローチよりも速いものはありますか? ファイルの読み取りを高速化する必要はないと思います。解析自体だけです。ただし、どちらも害はありません(:

score 9 · Accepted Answer

このプログラム

#include <iostream>
int main ()
{
    int num;
    while (std::cin >> num) ;
}

ファイルを読み取るのに約 17 秒かかります。このコード

#include <iostream>   
int main()
{
    int lc = 0;
    int item = 0;
    char buf[2048];
    do
    {
        std::cin.read(buf, sizeof(buf));
        int k = std::cin.gcount();
        for (int i = 0; i < k; ++i)
        {
            switch (buf[i])
            {
                case '\r':
                    break;
                case '\n':
                    item = 0; lc++;
                    break;
                case ' ':
                    item = 0;
                    break;
                case '0': case '1': case '2': case '3':
                case '4': case '5': case '6': case '7':
                case '8': case '9':
                    item = 10*item + buf[i] - '0';
                    break;
                default:
                    std::cerr << "Bad format\n";
            }    
        }
    } while (std::cin);
}

同じファイルに 1.25 秒必要です。作りたいものを作って...

score 2 · Accepted Answer

ストリームは遅いです。本当に高速に処理したい場合は、ファイル全体をメモリにロードし、メモリ内で解析します。本当にすべてをメモリにロードできない場合は、チャンクでロードし、それらのチャンクをできるだけ大きくして、メモリ内のチャンクを解析します。

メモリ内で解析するときは、スペースと行末をヌルに置き換えて、使用atoiして整数に変換できるようにします。

ああ、チャンクの最後で数字が切れるかどうかわからないので、チャンクの最後で問題が発生します。これを簡単に解決するには、チャンクの終わりの少し前 (16 バイトで十分です) で停止し、このテールを最初にコピーしてから、次のチャンクをロードします。

score 1 · Accepted Answer

入力イテレータを試しましたか?

文字列の作成をスキップします。

std::istream_iterator<int> begin(infile);
std::istream_iterator<int> end;
int item = 0;
while(begin != end)
    item = *begin++;

score 1 · Accepted Answer

Jack Aidleyの回答（コメントにコードを入れることはできません）をフォローアップすると、ここにいくつかの疑似コードがあります：

vector<char> buff( chunk_size );
roffset = 0;
char* chunk = &buff[0];
while( not done with file )
{
    fread( chunk + roffset, ... ); // Read a sizable chunk into memory, filling in after roffset
    roffset = find_last_eol(chunk); // find where the last full line ends
    parse_in_mem( chunk, chunk_size - roffset ); // process up to the last full line
    move_unprocessed_to_front( chunk, roffset ); // don't re-read what's already in mem
}

score 1 · Accepted Answer

ストリームとラインバッファーをスキップして、ファイルストリームから直接読み取ってみませんか?

template<class T, class CharT, class CharTraits>
std::vector<T> read(std::basic_istream<CharT, CharTraits> &in) {
    std::vector<T> ret;
    while(in.good()) {
        T x;
        in >> x;
        if(in.good()) ret.push_back(x);
    }
    return ret;
}

http://ideone.com/FNJKFa

c++ - C++ でのファイルからの整数読み取りを高速化

5 に答える 5

Related

Reference