c++ - I think STL is causing my application triple its memory usage

Question

I am inputting a 200mb file in my application and due to a very strange reason the memory usage of my application is more than 600mb. I have tried vector and deque, as well as std::string and char * with no avail. I need the memory usage of my application to be almost the same as the file I am reading, any suggestions would be extremely helpful. Is there a bug that causes so much memory consumption? Could you pinpoint the problem or should I rewrite the whole thing?

Windows Vista SP1 x64, Microsoft Visual Studio 2008 SP1, 32Bit Release Version, Intel CPU

The whole application until now:

#include <string>
#include <vector>
#include <iostream>
#include <iomanip>
#include <fstream>
#include <sstream>
#include <iterator>
#include <algorithm>
#include <time.h>



static unsigned int getFileSize (const char *filename)
{
    std::ifstream fs;
    fs.open (filename, std::ios::binary);
    fs.seekg(0, std::ios::beg);
    const std::ios::pos_type start_pos = fs.tellg();
    fs.seekg(0, std::ios::end);
    const std::ios::pos_type end_pos = fs.tellg();
    const unsigned int ret_filesize (static_cast<unsigned int>(end_pos - start_pos));
    fs.close();
    return ret_filesize;
}
void str2Vec (std::string &str, std::vector<std::string> &vec)
{
    int newlineLastIndex(0);
    for (int loopVar01 = str.size(); loopVar01 > 0; loopVar01--)
    {
        if (str[loopVar01]=='\n')
        {
            newlineLastIndex = loopVar01;
            break;
        }
    }
    int remainder(str.size()-newlineLastIndex);

    std::vector<int> indexVec;
    indexVec.push_back(0);
    for (unsigned int lpVar02 = 0; lpVar02 < (str.size()-remainder); lpVar02++)
    {
        if (str[lpVar02] == '\n')
        {
            indexVec.push_back(lpVar02);
        }
    }
    int memSize(0);
    for (int lpVar03 = 0; lpVar03 < (indexVec.size()-1); lpVar03++)
    {
        memSize = indexVec[(lpVar03+1)] - indexVec[lpVar03];
        std::string tempStr (memSize,'0');
        memcpy(&tempStr[0],&str[indexVec[lpVar03]],memSize);
        vec.push_back(tempStr);
    }
}
void readFile(const std::string &fileName, std::vector<std::string> &vec)
{
    static unsigned int fileSize = getFileSize(fileName.c_str());
    static std::ifstream fileStream;
    fileStream.open (fileName.c_str(),std::ios::binary);
    fileStream.clear();
    fileStream.seekg (0, std::ios::beg);
    const int chunks(1000); 
    int singleChunk(fileSize/chunks);
    int remainder = fileSize - (singleChunk * chunks);
    std::string fileStr (singleChunk, '0');
    int fileIndex(0);
    for (int lpVar01 = 0; lpVar01 < chunks; lpVar01++)
    {
        fileStream.read(&fileStr[0], singleChunk);
        str2Vec(fileStr, vec);
    }
    std::string remainderStr(remainder, '0');
    fileStream.read(&remainderStr[0], remainder);
    str2Vec(fileStr, vec);      
}
int main (int argc, char *argv[])
{   
        std::vector<std::string> vec;
        std::string inFile(argv[1]);
        readFile(inFile, vec);
}

score 5 · Accepted Answer

あなたの記憶は断片化されています。

このようなことを試してください：

  HANDLE heaps[1025];
  DWORD nheaps = GetProcessHeaps((sizeof(heaps) / sizeof(HANDLE)) - 1, heaps);

  for (DWORD i = 0; i < nheaps; ++i) 
  {
    ULONG  HeapFragValue = 2;
    HeapSetInformation(heaps[i],
                       HeapCompatibilityInformation,
                       &HeapFragValue,
                       sizeof(HeapFragValue));
  }

score 3 · Accepted Answer

私がこの権利を読んでいる場合、最大の問題は、このアルゴリズムが必要なメモリを自動的に2倍にすることです。

ReadFile（）では、ファイル全体を「singleChunk」サイズの文字列（チャンク）のセットに読み込み、str2Vec（）の最後のループで、チャンクの改行で区切られたすべてのセグメントに一時文字列を割り当てます。つまり、その場でメモリが2倍になります。

速度の問題もあります。str2vecはチャンクを2回パスして、すべての改行を検索します。1つでそれを行うことができない理由はありません。

score 2 · Accepted Answer

もう1つできることは、ファイル全体を1つのメモリブロックにロードすることです。次に、各行の最初の文字へのポインタのベクトルを作成し、同時に、改行を\ 0に置き換えて、nullで終了します。（もちろん、文字列に\ 0が含まれているとは想定されていません。）

文字列のベクトルを持つことは必ずしも便利ではありませんが、constchar*のベクトルを持つことは潜在的に「同じくらい良い」です。

score 2 · Accepted Answer

STL コンテナは、メモリ操作を抽象化するために存在します。メモリ制限が厳しい場合、それらを実際に抽象化することはできません。

mmap()(または、Windows では、) でファイルを読み取るために使用することをお勧めしますMapViewOfFile()。

score 1 · Accepted Answer

std::listは使用しないでください。ベクトルよりも多くのメモリが必要になります。
vectorは、いわゆる「ダブリング」を実行します。つまり、スペースが不足すると、現在のメモリの2倍を割り当てます。これを回避するには、std :: vector :: reserved（）メソッドを使用できます。間違っていない場合は、std :: vector :: capacity（）メソッドを使用して確認できます（capacity（）> = size（）に注意してください）。）。

実行中は行数がわからないため、「ダブリング」の問題を回避するための単純なアルゴリズムはありません。slavy13.myopenid.comによるコメントから、解決策は、読み取りが終了した後、情報を別の事前予約されたベクトルに移動することです（関連する質問はstd :: vectorを縮小する方法ですか？）。

score 1 · Accepted Answer

まず、メモリ使用量をどのように判断しますか？タスクマネージャは、実際にはメモリ使用量ではないため、そのための適切なツールではありません。

次に、（何らかの理由で？）静的変数を除いて、ファイルの読み取りが完了したときに解放されないデータはベクトルだけです。したがって、その容量をテストし、それに含まれる各文字列の容量をテストします。それぞれが使用するメモリの量を調べます。メモリがどこで使われているかを判断するためのツールがあります。

score 1 · Accepted Answer

readFile 内には、ファイルの少なくとも 2 つのコピー (ifstream と、 std::vector にコピーされたデータ) があります。ファイルを開いてそのままコピーしている限り、総メモリフットプリントをファイルサイズの 2 倍以下にするのは難しいでしょう。

score 1 · Accepted Answer

独自のバッファリング戦略を作成しようとする試みは見当違いだと思います。

ストリームには、非常に優れたバッファリング戦略が既に実装されています。より大きなバッファーが必要だと思われる場合は、バッファーを制御するための追加コードなしで、基本的なバッファーをストリームにインストールできます。

これが私が思いついたものです: NB は、オンラインで見つけた "King James Bible" のテキストバージョンでテストしました。

#include <string>
#include <vector>
#include <list>
#include <fstream>
#include <algorithm>
#include <iterator>
#include <iostream>

class Line: public std::string
{
};

std::istream& operator>>(std::istream& in,Line& line)
{
    // Relatively efficient way to copy a line into a string.
    return std::getline(in,line);
}
std::ostream& operator<<(std::ostream& out,Line const& line)
{
    return out << static_cast<std::string const&>(line) << "\n";
}

void readLinesFromStream(std::istream& stream,std::vector<Line>& lines)
{
    /*
     * Read into a list as this is flexible in memory usage and will not
     * allocate huge chunks of un-required space.
     *
     * Even with huge files the space for list will be insignificant
     * compared to the size of the data.
     *
     * This then allows us to reserve the correct size of the vector
     * Thus avoiding huge memory chunks being prematurely allocated that
     * are not required. It also prevents the internal structure from
     * being copied every time the container is re-sized.
     */
    std::list<Line>     data;
    std::copy(  std::istream_iterator<Line>(stream),
                std::istream_iterator<Line>(),
                std::inserter(data,data.end())
             );

    /*
     * Reserve the correct size in the vector.
     * then copy out of the list into the vector
     */
    lines.reserve(data.size());
    std::copy(  data.begin(),
                data.end(),
                std::back_inserter(lines)
             );
}

void readLinesFromFile(std::string const& name,std::vector<Line>& lines)
{
    /*
     * Set up the file stream and override the default buffer used by the stream.
     * Make it big because we think the istream buffer is insufficient!!!!
     */
    std::ifstream       file;
    std::vector<char>   buffer(10000);
    file.rdbuf()->pubsetbuf(&buffer[0],buffer.size());

    file.open(name.c_str());
    readLinesFromStream(file,lines);
}


int main(int argc,char* argv[])
{
    std::vector<Line>   lines;
    readLinesFromFile(argv[1],lines);

    // Un-comment if your file is larger than 1100 lines.

    // I tested with a copy of the King James bible. 
    // std::cout << "Lines: " << lines.size() << "\n";
    // std::copy(lines.begin() + 1000,lines.begin() + 1100,std::ostream_iterator<Line>(std::cout));
}

score 0 · Accepted Answer

おそらく、メモリ内のファイル全体を読み取る必要がある理由を詳しく説明する必要があります。ファイル全体を一度にメモリに読み取ることなく、必要な処理を実行する方法があるのではないかと思います。この機能が本当に必要な場合は、メモリマップトファイルを調べてください。これは、同等のファイルを作成するよりも効率的である可能性があります。内部データ構造は、ファイルへのオフセットを使用できます。ところで、文字エンコードを処理する必要があるかどうかを必ず確認してください。

score 0 · Accepted Answer

fileStreamとして宣言したためstatic、スコープ外になることはなく、実行の最後の瞬間までファイルが閉じられないことを知っておく必要があります。これには確かにある程度のメモリが含まれます。その最後の直前に明示的に閉じてstr2Vec、状況を改善しようとすることができます。

また、同じファイルを複数回開いたり閉じたりします。一度開いて参照して渡すだけです（必要に応じて状態をリセットします）。ファイルを1回パスするだけで、必要なものを実行できると思いますが。

ちなみに、ここで行うようにファイルサイズを本当に知る必要があるとは思えません。短い読み取りが行われるまで（その時点で）、サイズの「チャンク」の量を読み取ることができます。

コードの目的を説明してみませんか。もっと簡単な解決策があると思います。

score 0 · Accepted Answer

ベクトルの代わりにリストを使用してみてください。ベクトルは (ほとんどの場合) メモリ内で線形です。

確かに、(ほとんどの場合) 変更時にコピーされ、参照カウントされる文字列が内部にあるという事実は、その問題を軽減するはずですが、役立つかもしれません。

score 0 · Accepted Answer

あなたのファイルがどのように見えるか本当にわからないので、これが関連しているかどうかはわかりません。

ただし、非常に短い文字列を格納する場合、 std::string にはかなりのスペースオーバーヘッドが発生する可能性があることに注意してください。また、非常に短い文字列に対して char* を個別に新規作成する場合は、割り当てブロックのオーバーヘッドもすべて表示されます。

そのベクトルにいくつの文字列を入れていますか?それらの平均の長さは?

score 0 · Accepted Answer

行を作成する最善の方法は、ファイルを読み取り専用メモリマップすることです。\n の代わりに \0 をわざわざ書くのではなく、s のペアconst char *、likestd::pair<const char*, const char*>またはconst char*s と count. のペアを使用します。行を編集する必要がある場合は、ポインターを格納できるオブジェクトを作成することをお勧めします。変更された行を含むペアまたは std::string 。

STL vector または deques を使用してメモリ内のスペースを節約するための良いテクニックは、追加が完了するまで 2 倍にすることです。次に、実際のサイズにサイズ変更して、未使用のメモリを解放してヒープアロケータに戻します。私はそれについて心配しませんが、メモリはまだプログラムに割り当てられている可能性があります。また、デフォルトのサイズを取得する代わりに、ファイルサイズをバイト単位で取得することから始め、1 行あたりの平均文字数で最も適切な推定値で割り、最初にその分のスペースを確保します。

score -1 · Accepted Answer

pushBack（）によってベクトルを大きくすると、メモリの断片化と非効率的なメモリ使用が発生します。代わりにリストを使用して、必要な要素の数が正確にわかっている場合にのみ、ベクトル（必要な場合）を作成してみます。

c++ - I think STL is causing my application triple its memory usage

14 に答える 14

Related

Reference