r - パッケージ data.table の fread を使用して、一度にチャンクで読み取る

Question

freadpackageの関数を使用して、大きなタブ区切りファイル (約 2GB) を入力しようとしていますdata.table。ただし、サイズが大きいため、メモリに完全には収まりません。skip次のようなandnrow引数を使用して、チャンクで入力しようとしました。

chunk.size = 1e6
done = FALSE
chunk = 1
while(!done)
{
    temp = fread("myfile.txt",skip=(chunk-1)*chunk.size,nrow=chunk.size-1)
    #do something to temp
    chunk = chunk + 1
    if(nrow(temp)<2) done = TRUE
}

上記の例では、一度に 100 万行を読み取り、それらに対して計算を実行し、次の 100 万行を取得しています。このコードの問題は、すべてのチャンクが取得された後にfread、ファイルは、ループの繰り返しごとに最初からskip100 万ずつ増加します。その結果、各チャンクの後、fread実際に次のチャンクに到達するまでに時間がかかり、非常に非効率的になります。

freadたとえば 100 万行ごとに一時停止し、最初からやり直すことなく、その時点から読み続けるように指示する方法はありますか? 解決策はありますか、それともこれは新しい機能のリクエストですか?

score 7 · Accepted Answer

関連するオプションは、チャンクパッケージです。3.5 GB のテキストファイルの例を次に示します。

library(chunked)
library(tidyverse)

# I want to look at the daily page views of Wikipedia articles
# before 2015... I can get zipped log files
# from here: hhttps://dumps.wikimedia.org/other/pagecounts-ez/merged/2012/2012-12/
# I get bz file, unzip to get this: 

my_file <- 'pagecounts-2012-12-14/pagecounts-2012-12-14'

# How big is my file?
print(paste(round(file.info(my_file)$size  / 2^30,3), 'gigabytes'))
# [1] "3.493 gigabytes" too big to open in Notepad++ !
# But can read with 010 Editor

# look at the top of the file 
readLines(my_file, n = 100)

# to find where the content starts, vary the skip value, 
read.table(my_file, nrows = 10, skip = 25)

これは、ファイルのチャンクで作業を開始する場所です。通常の方法でほとんどの dplyr 動詞を使用できます。

# Let the chunked pkg work its magic! We only want the lines containing 
# "Gun_control". The main challenge here was identifying the column
# header
df <- 
read_chunkwise(my_file, 
               chunk_size=5000,
               skip = 30,
               format = "table",
               header = TRUE) %>% 
  filter(stringr::str_detect(De.mw.De.5.J3M1O1, "Gun_control"))

# this line does the evaluation, 
# and takes a few moments...
system.time(out <- collect(df))

ここでは、入力ファイルよりもはるかに小さいため、通常どおり出力を処理できます。

# clean up the output to separate into cols, 
# and get the number of page views as a numeric
out_df <- 
out %>% 
  separate(De.mw.De.5.J3M1O1, 
           into = str_glue("V{1:4}"),
           sep = " ") %>% 
  mutate(V3 = as.numeric(V3))

 head(out_df)
    V1                                                        V2   V3
1 en.z                                               Gun_control 7961
2 en.z Category:Gun_control_advocacy_groups_in_the_United_States 1396
3 en.z          Gun_control_policy_of_the_Clinton_Administration  223
4 en.z                            Category:Gun_control_advocates   80
5 en.z                         Gun_control_in_the_United_Kingdom   68
6 en.z                                    Gun_control_in_america   59
                                                                                 V4
1 A34B55C32D38E32F32G32H20I22J9K12L10M9N15O34P38Q37R83S197T1207U1643V1523W1528X1319
2                                     B1C5D2E1F3H3J1O1P3Q9R9S23T197U327V245W271X295
3                                     A3B2C4D2E3F3G1J3K1L1O3P2Q2R4S2T24U39V41W43X40
4                                                            D2H1M1S4T8U22V10W18X14
5                                                             B1C1S1T11U12V13W16X13
6                                                         B1H1M1N2P1S1T6U5V17W12X12

#--------------------

r - パッケージ data.table の fread を使用して、一度にチャンクで読み取る

4 に答える 4

Related

Reference