c++ - C++での高速でシンプルなCSV解析

Question

次のような形式のデータを含む単純なCSVファイルを解析しようとしています。

20.5,20.5,20.5,0.794145,4.05286,0.792519,1
20.5,30.5,20.5,0.753669,3.91888,0.749897,1
20.5,40.5,20.5,0.701055,3.80348,0.695326,1

したがって、非常に単純で固定された形式のファイルです。このデータの各列をSTLベクトルに格納しています。そのため、標準ライブラリを使用してC ++の方法を維持しようとしましたが、ループ内の実装は次のようになります。

string field;
getline(file,line);
stringstream ssline(line);

getline( ssline, field, ',' );
stringstream fs1(field);
fs1 >> cent_x.at(n);

getline( ssline, field, ',' );
stringstream fs2(field);
fs2 >> cent_y.at(n);

getline( ssline, field, ',' );
stringstream fs3(field);
fs3 >> cent_z.at(n);

getline( ssline, field, ',' );
stringstream fs4(field);
fs4 >> u.at(n);

getline( ssline, field, ',' );
stringstream fs5(field);
fs5 >> v.at(n);

getline( ssline, field, ',' );
stringstream fs6(field);
fs6 >> w.at(n);

問題は、これは非常に遅く（データファイルあたり100万行を超える）、少しエレガントではないように思われることです。標準ライブラリを使用するより高速なアプローチはありますか、それともstdio関数を使用する必要がありますか？このコードブロック全体が単一のfscanf呼び出しに還元されるように私には思えます。

前もって感謝します！

score 9 · Accepted Answer

あなたがたった1つでそれを行うことができるときに7つのストリングストリームを使用することは確かにwrtを助けません。パフォーマンス。代わりにこれを試してください：

string line;
getline(file, line);

istringstream ss(line);  // note we use istringstream, we don't need the o part of stringstream

char c1, c2, c3, c4, c5;  // to eat the commas

ss >> cent_x.at(n) >> c1 >>
      cent_y.at(n) >> c2 >>
      cent_z.at(n) >> c3 >>
      u.at(n) >> c4 >>
      v.at(n) >> c5 >>
      w.at(n);

ファイルの行数がわかっている場合は、読み取る前にベクトルのサイズを変更してから、のoperator[]代わりに使用できますat()。このようにして、境界チェックを回避し、パフォーマンスを少し向上させます。

score 2 · Accepted Answer

主なボトルネック（getline（）ベースのバッファなしI / Oは別として）は文字列の解析だと思います。区切り文字として「、」記号を使用しているため、文字列に対して線形スキャンを実行し、すべての「、」を「\ 0」（文字列の終わりマーカー、ゼロターミネータ）に置き換えることができます。

このようなもの：

// tmp array for the line part values
double parts[MAX_PARTS];

while(getline(file, line))
{
    size_t len = line.length();
    size_t j;

    if(line.empty()) { continue; }

    const char* last_start = &line[0];
    int num_parts = 0;

    while(j < len)
    {
        if(line[j] == ',')
        {
           line[j] = '\0';

           if(num_parts == MAX_PARTS) { break; }

           parts[num_parts] = atof(last_start);
           j++;
           num_parts++;
           last_start = &line[j];
        }
        j++;
    }

    /// do whatever you need with the parts[] array
 }

score 2 · Accepted Answer

これが受け入れられた答えよりも速いかどうかはわかりませんが、試してみたい場合はとにかく投稿したほうがいいでしょう。fseekマジックを使用してファイルのサイズを知ることにより、1回の読み取り呼び出しを使用してファイルの内容全体をロードできます。これは、複数の読み取り呼び出しよりもはるかに高速です。

次に、次のような操作を行って文字列を解析できます。

//Delimited string to vector
vector<string> dstov(string& str, string delimiter)
{
  //Vector to populate
  vector<string> ret;
  //Current position in str
  size_t pos = 0;
  //While the the string from point pos contains the delimiter
  while(str.substr(pos).find(delimiter) != string::npos)
  {
    //Insert the substring from pos to the start of the found delimiter to the vector
    ret.push_back(str.substr(pos, str.substr(pos).find(delimiter)));
    //Move the pos past this found section and the found delimiter so the search can continue
    pos += str.substr(pos).find(delimiter) + delimiter.size();
  }
  //Push back the final element in str when str contains no more delimiters
  ret.push_back(str.substr(pos));
  return ret;
}

string rawfiledata;

//This call will parse the raw data into a vector containing lines of
//20.5,30.5,20.5,0.753669,3.91888,0.749897,1 by treating the newline
//as the delimiter
vector<string> lines = dstov(rawfiledata, "\n");

//You can then iterate over the lines and parse them into variables and do whatever you need with them.
for(size_t itr = 0; itr < lines.size(); ++itr)
  vector<string> line_variables = dstov(lines[itr], ",");

c++ - C++での高速でシンプルなCSV解析

3 に答える 3

Related

Reference