c - ファイルから単語ごとに読み取るか、一度に1行ずつ読み取り、Cを使用して文字列を分割する、より効率的な方法は何ですか？

Question

ディスク上のファイルから単語ごとにチェックする必要があるCでアプリケーションを開発したいと思います。ファイルへのアクセスが少なくて済むので、ファイルから1行を読み取り、それを単語に分割する方が効率的であると言われています。それは本当ですか？

score 8 · Accepted Answer

ファイル全体が必要になることがわかっている場合は、できるだけ大きなチャンクでファイルを読み取ることもできます（極端な場合、ファイル全体を一度にメモリにマップします）。これは、必要なファイルアクセスが少ないためです。

ただし、プログラムが遅くない場合は、開発が最も速く、バグが少ない方法でプログラムを作成してください。初期の最適化は重大な罪です。

score 5 · Accepted Answer

Not really true, assuming you're going to be using scanf() and your definition of 'word' matches what scanf() treats as a word.

The standard I/O library will buffer the actual disk reads, and reading a line or a word will have essentially the same I/O cost in terms of disk accesses. If you were to read big chunks of a file using fread(), you might get some benefit — at a cost in complexity.

But for reading words, it's likely that scanf() and a protective string format specifier such as %99s if your array is char word[100]; would work fine and is probably simpler to code.

If your definition of word is more complex than the definition supported by scanf(), then reading lines and splitting is probably easier.

score 2 · Accepted Answer

As far as splitting is concerned there is no difference with respect to performance. You are splitting using whitespace in one case and newline in another.

However it would impact in case of word in a way that you would need to allocate buffers M times, while in case of lines it will be N times, where M>N. So if you are adopting word split approach, try to calculate total memory need first, allocate that much chunk (so you don't end up with fragmented M chunks), and later get M buffers from that chunk. Note that same approach can be applied in lines split but the difference will be less visible.

score 1 · Accepted Answer

This is correct, you should read them in to a buffer, and then split into whatever you define as 'words'. The only case where this would not be true is if you can get fscanf() to correctly grab out what you consider to be words (doubtful).

score 0 · Accepted Answer

主なパフォーマンスのボトルネックは次のとおりです。

stdio ファイル I/O 関数の呼び出し。呼び出しが少ないほど、オーバーヘッドは少なくなります。
動的メモリ割り当て。できるだけ少なくする必要があります。結局、malloc の呼び出しが多いと、ヒープのセグメンテーションが発生します。

要するに、古典的なプログラミングの考慮事項です。実行時間を短縮するか、メモリ使用量を少なくすることができます。両方を取得することはできませんが、実行時間とメモリ消費の両方の点で最も効果的な適切な中間点を見つけることができます。

極端な例としては、ファイル全体を 1 つの大きなチャンクとして読み取り、動的メモリにアップロードすることで、可能な限り高速に実行できます。または、逆に、バイトごとに読み取り、読み取り時に評価することができます。これにより、プログラムが遅くなる可能性がありますが、動的メモリはまったく使用されません。

コードを最も効果的に最適化するには、さまざまな CPU 固有および OS 固有の機能に関する基本的な知識が必要です。アラインメント、キャッシュメモリレイアウト、基になる API 関数呼び出しの有効性などの問題はすべて重要です。

いくつかの異なる方法を試して、それらをベンチマークしてみませんか?

score 0 · Accepted Answer

実際に正確な質問（単語と行）に答えるわけではありませんが、メモリ内のすべての単語を同時に必要とする場合、最も効率的なアプローチは次のとおりです。

ファイルサイズを決定する
ファイル全体に 1 バイトを加えたバッファを割り当てます
ファイル全体をバッファに読み込み'\0'、余分なバイトに入れます。
それをパスして、単語数を数えます
char*(単語へのポインター) またはint(バッファーへのインデックス) インデックス配列を、サイズが単語数と一致するように割り当てます
バッファを介して 2 回目のパスを作成し、アドレスまたは単語の最初の文字へのインデックスをインデックス配列に格納し、バッファ内の他のバイトを'\0'(文字列マーカーの終わり) で上書きします。

十分なメモリがある場合は、単語数の最悪のケースを想定する方がおそらくわずかに高速です: (filesize+1) / 2(1 文字の単語の間に 1 つのスペースがあり、ファイル内のバイト数が奇数である)。また、インデックス配列を使用して Java ArrayList または Qt QVector アプローチを採用し、realloc()単語数が現在の容量を超えたときにそのサイズを 2 倍にすることは、非常に効率的です (倍増 = 指数関数的な増加のため、再割り当ては何度も発生しません)。

c - ファイルから単語ごとに読み取るか、一度に1行ずつ読み取り、Cを使用して文字列を分割する、より効率的な方法は何ですか？

6 に答える 6

Related

Reference