performance - なぜこのrコードはとても遅いのですか?

Question

別のデータフレームの情報に基づいてデータフレームを作成しようとしています。

最初のデータフレーム (base_mar_bop) には次のようなデータがあります。

201301|ABC|4
201302|DEF|12

私の願いは、これから16行のデータフレームを作成することです:

4 times: 201301|ABC|1
12 times: 201302|DEF|1

実行に時間がかかるスクリプトを作成しました。アイデアを得るために、最終的なデータフレームには約 200 万行があり、ソースデータフレームには約 10,000 行あります。データの機密性のため、データフレームのソースファイルを投稿できません。

このコードを実行するには時間がかかるため、PHP で実行することにしました。1 分もかからずに実行され、作業が完了し、txt ファイルに書き込み、txt ファイルを R にインポートしました。

なぜRがそんなに時間がかかるのか、私には手がかりがありません..それは関数の呼び出しですか? ネストされたforループですか？私の観点からは、そこには計算集約的なステップはそれほど多くありません。

# first create an empty dataframe called base_eop that will each subscriber on a row 

identified by CED, RATEPLAN and 1
# where 1 is the count and the sum of 1 should end up with the base
base_eop <-base_mar_bop[1,]

# let's give some logical names to the columns in the df
names(base_eop) <- c('CED','RATEPLAN','BASE')


# define the function that enables us to insert a row at the bottom of the dataframe
insertRow <- function(existingDF, newrow, r) {
  existingDF[seq(r+1,nrow(existingDF)+1),] <- existingDF[seq(r,nrow(existingDF)),]
  existingDF[r,] <- newrow
  existingDF
}


# now loop through the eop base for march, each row contains the ced, rateplan and number of subs
# we need to insert a row for each individual sub
for (i in 1:nrow(base_mar_eop)) {
  # we go through every row in the dataframe
  for (j in 1:base_mar_eop[i,3]) {
    # we insert a row for each CED, rateplan combination and set the base value to 1
    base_eop <- insertRow(base_eop,c(base_mar_eop[i,1:2],1),nrow(base_eop)) 
  }
}

# since the dataframe was created using the first row of base_mar_bop we need to remove this first row
base_eop <- base_eop[-1,]

score 4 · Accepted Answer

Here is one approach with data.table, though @BenBolker's timings are already awesome.

library(data.table)
DT <- data.table(d2)  ## d2 from @BenBolker's answer
out <- DT[, ID:=1:.N][rep(ID, BASE)][, `:=`(BASE=1, ID=NULL)]
out
#            CED RATEPLAN BASE
#       1:     1        A    1
#       2:     1        A    1
#       3:     1        A    1
#       4:     1        A    1
#       5:     1        A    1
#      ---                    
# 1999996: 10000        Y    1
# 1999997: 10000        Y    1
# 1999998: 10000        Y    1
# 1999999: 10000        Y    1
# 2000000: 10000        Y    1

Here, I've used compound queries to do the following:

Create an ID variable that is really just 1 to the number of rows in the data.table.
Use rep to repeat the ID variable by the corresponding BASE value.
Replaced all BASE values with "1" and dropped the ID variable we created earlier.

Perhaps there is a more efficient way to do this though. For example, dropping one of the compound queries should make it a little faster. Perhaps something like:

out <- DT[rep(1:nrow(DT), BASE)][, BASE:=1]

performance - なぜこのrコードはとても遅いのですか?

2 に答える 2

Related

Reference