python - ドライブにデータを保存する最も効率的な方法

Question

ベースライン - 10,000 エントリの CSV データがあります。これを 1 つの csv ファイルとして保存し、一度にすべて読み込みます。

代替 - 10,000 エントリの CSV データがあります。これを 10,000 個の CSV ファイルとして保存し、個別に読み込みます。

これは、計算上、どのくらい非効率的ですか。私はメモリの問題にはあまり興味がありません。代替方法の目的は、データのサブセットに頻繁にアクセスする必要があり、配列全体を読み取る必要がないためです。

私はパイソンを使用しています。

編集：必要に応じて他のファイル形式を使用できます。

Edit1: SQLite が勝ちます。以前に比べて驚くほど簡単で効率的です。

score 6 · Accepted Answer

SQLiteは、アプリケーションにとって理想的なソリューションです。

CSV ファイルを SQLite データベーステーブルにインポートし (単一ファイルになります)、必要に応じてインデックスを追加します。

データにアクセスするには、python sqlite3 ライブラリを使用します。使用方法については、このチュートリアルを使用できます。

他の多くのソリューションと比較して、SQLite は部分的なデータセットをローカルで選択する最速の方法です。10000 個のファイルにアクセスするよりもはるかに高速です。SQLite が優れている理由を説明するこの回答もお読みください。

score 1 · Accepted Answer

すべての行を 1 つのファイルに書き込みます。10,000 行の場合、おそらく価値はありませんが、すべての行を同じ長さ (たとえば 1000 バイト) にパディングできます。

次に、n 番目の行に移動するのは簡単seekです。n に行の長さを掛けるだけです。

score 0 · Accepted Answer

10,000 files is going to be slower to load and access than one file, if only because the files' data will likely be fragmented around your disk drive, so accessing it will require a much larger number of seeks than would accessing the contents of a single file, which will generally be stored as sequentially as possible. Seek times are a big slowdown on spinning media, since your program has to wait while the drive heads are physically repositioned, which can take milliseconds. (slow seeks times aren't an issue for SSDs, but even then there will still be the overhead of 10,000 file's worth of metadata for the operating system to deal with). Also with a single file, the OS can speed things up for you by doing read-ahead buffering (as it can reasonably assume that if you read one part of the file, you will likely want to read the next part soon). With multiple files, the OS can't do that.

My suggestion (if you don't want to go the SQLite route) would be to use a single CSV file, and (if possible) pad all of the lines of your CSV file out with spaces so that they all have the same length. For example, say you make sure when writing out the CSV file to make all lines in the file exactly 80 bytes long. Then reading the (n)th line of the file becomes relatively fast and easy:

myFileObject.seek(n*80)
theLine = myFileObject.read(80)

python - ドライブにデータを保存する最も効率的な方法

3 に答える 3

Related

Reference