c++ - メモリマップされたフォーマットされたファイルから整数を読み取る

Question

次のように、1行に1つの整数を含む大きなフォーマット（テキスト）ファイルをメモリマップしました。

したがって、最初のバイトにメモリへのポインタがあり、最後のバイトにメモリへのポインタもあります。私はこれらすべての整数をできるだけ速く配列に読み込もうとしています。最初に、std::istreamを操作してそのメモリから読み取るための特殊なstd::streambufクラスを作成しましたが、比較的遅いようです。

「1231232\r \ n123123 \ r \ n123 \ r \ n1231 \ r \n2387897...」のような文字列を配列{1231232,123123,1231,231,2387897,. ..}？

ファイル内の整数の数は事前にはわかりません。

score 1 · Accepted Answer

これは、C++についてもう少し学ぶための非常に興味深いタスクでした。

確かに、コードは非常に大きく、多くのエラーチェックがありますが、それは解析中に問題が発生する可能性のあるさまざまな問題の数を示しているにすぎません。

#include <ctype.h>
#include <limits.h>
#include <stdio.h>

#include <iterator>
#include <vector>
#include <string>

static void
die(const char *reason)
{
  fprintf(stderr, "aborted (%s)\n", reason);
  exit(EXIT_FAILURE);
}

template <class BytePtr>
static bool
read_uint(BytePtr *begin_ref, BytePtr end, unsigned int *out)
{
  const unsigned int MAX_DIV = UINT_MAX / 10;
  const unsigned int MAX_MOD = UINT_MAX % 10;

  BytePtr begin = *begin_ref;
  unsigned int n = 0;

  while (begin != end && '0' <= *begin && *begin <= '9') {
    unsigned digit = *begin - '0';
    if (n > MAX_DIV || (n == MAX_DIV && digit > MAX_MOD))
      die("unsigned overflow");
    n = 10 * n + digit;
    begin++;
  }

  if (begin == *begin_ref)
    return false;

  *begin_ref = begin;
  *out = n;
  return true;
}

template <class BytePtr, class IntConsumer>
void
parse_ints(BytePtr begin, BytePtr end, IntConsumer out)
{
  while (true) {
    while (begin != end && *begin == (unsigned char) *begin && isspace(*begin))
      begin++;
    if (begin == end)
      return;

    bool negative = *begin == '-';
    if (negative) {
      begin++;
      if (begin == end)
        die("minus at end of input");
    }

    unsigned int un;
    if (!read_uint(&begin, end, &un))
      die("no number found");

    if (!negative && un > INT_MAX)
      die("too large positive");
    if (negative && un > -((unsigned int)INT_MIN))
      die("too small negative");

    int n = negative ? -un : un;
    *out++ = n;
  }
}

static void
print(int x)
{
  printf("%d\n", x);
}

int
main()
{
  std::vector<int> result;
  std::string input("2147483647 -2147483648 0 00000 1 2 32767 4 -17 6");

  parse_ints(input.begin(), input.end(), back_inserter(result));

  std::for_each(result.begin(), result.end(), print);
  return 0;
}

未定義の動作を呼び出さないように一生懸命努力しました。これは、符号なしの数値を符号付きの数値に変換したりisspace、不明なデータ型を呼び出したりするときに非常に注意が必要です。

score 0 · Accepted Answer

注：この回答は数回編集されています。

（リンクとリンクに基づいて）メモリを1行ずつ読み取ります。

class line 
{
   std::string data;
public:
   friend std::istream &operator>>(std::istream &is, line &l) 
   {
      std::getline(is, l.data);
      return is;
   }
   operator std::string() { return data; }    
};

std::streambuf osrb;
setg(ptr, ptr, ptrs + size-1);
std::istream istr(&osrb);

std::vector<int> ints;

std::istream_iterator<line> begin(istr);
std::istream_iterator<line> end;
std::transform(begin, end, std::back_inserter(ints), &boost::lexical_cast<int, std::string>);

score 0 · Accepted Answer

これはメモリマップトであるため、charsをスタック配列にコピーし、atoiを別のメモリマップトファイルの上にある別の整数配列にコピーするだけで非常に効率的です。このように、ページングファイルはこれらの大きなバッファにはまったく使用されません。

open memory mapped file to output int buffer

declare small stack buffer of 20 chars
while not end of char array
  while current char not  line feed
    copy chars to stack buffer
    null terminate the buffer two chars back
    copy results of int buffer output buffer
    increment the output buffer pointer
  end while  
end while

これはライブラリを使用しませんが、メモリマップトファイルへのメモリ使用量を最小限に抑えるという利点があるため、一時バッファはスタック1とatoiが内部で使用するものに制限されます。必要に応じて、出力バッファを破棄するか、ファイルに保存したままにすることができます。

score 0 · Accepted Answer

std::vector<int> array;
char * p = ...; // start of memory mapped block
while ( not end of memory block )
{
    array.push_back(static_cast<int>(strtol(p, &p, 10)));
    while (not end of memory block && !isdigit(*p))
        ++p;
}

strtolこのコードは、メモリマップドブロックの最後で停止する保証がないため、少し安全ではありませんが、開始です。追加のチェックを追加しても、非常に高速に実行されるはずです。

c++ - メモリマップされたフォーマットされたファイルから整数を読み取る

4 に答える 4

Related

Reference