r - R: Is it possible to parallelize / speed-up the reading in of a 20 million plus row CSV into R?

Question

Once the CSV is loaded via read.csv, it's fairly trivial to use multicore, segue etc to play around with the data in the CSV. Reading it in, however, is quite the time sink.

Realise it's better to use mySQL etc etc.

Assume the use of an AWS 8xl cluster compute instance running R2.13

Specs as follows:

Cluster Compute Eight Extra Large specifications:
88 EC2 Compute Units (Eight-core 2 x Intel Xeon)
60.5 GB of memory
3370 GB of instance storage
64-bit platform
I/O Performance: Very High (10 Gigabit Ethernet)

Any thoughts / ideas much appreciated.

score 5 · Accepted Answer

freadinを使用する場合、並列処理は必要ない場合がありますdata.table。

library(data.table)
dt <- fread("myFile.csv")

この質問へのコメントは、その力を示しています。また、私自身の経験からの例を次に示します。

d1 <- fread('Tr1PointData_ByTime_new.csv')
Read 1048575 rows and 5 (of 5) columns from 0.043 GB file in 00:00:09

10 秒未満で 104 万行を読み取ることができました。

score 4 · Accepted Answer

あなたができることは使用することですscan。その入力引数の2つが興味深いことがわかる可能性があります：nとskip。ファイルへの2つ以上の接続を開き、を使用skipしnて、ファイルから読み取りたい部分を選択するだけです。いくつかの注意点があります：

ある段階で、ディスクI/Oがボトルネックを証明する可能性があります。
同じファイルへの複数の接続を開いたときにスキャンが文句を言わないことを願っています。

しかし、試してみて、速度が向上するかどうかを確認することもできます。

r - R: Is it possible to parallelize / speed-up the reading in of a 20 million plus row CSV into R?

3 に答える 3

Related

Reference