r - 異なる列を持つ data.frames を rbind する効率的な方法

Question

列のセットが異なるデータフレームのリストがあります。それらを行単位で 1 つのデータフレームに結合したいと考えています。私はそれをするために使用plyr::rbind.fillします。これをより効率的に行うものを探していますが、ここで与えられた答えに似ています

require(plyr)

set.seed(45)
sample.fun <- function() {
   nam <- sample(LETTERS, sample(5:15))
   val <- data.frame(matrix(sample(letters, length(nam)*10,replace=TRUE),nrow=10))
   setNames(val, nam)  
}
ll <- replicate(1e4, sample.fun())
rbind.fill(ll)

score 40 · Accepted Answer

更新:代わりに、この更新された回答を参照してください。

UPDATE (eddi):これはバージョン 1.8.11fillでへの引数として実装されましたrbind。例えば：

DT1 = data.table(a = 1:2, b = 1:2)
DT2 = data.table(a = 3:4, c = 1:2)

rbind(DT1, DT2, fill = TRUE)
#   a  b  c
#1: 1  1 NA
#2: 2  2 NA
#3: 3 NA  1
#4: 4 NA  2

FR #4790が追加されました - data.frames/data.tables のリストをマージする rbind.fill (plyr から) のような機能

注 1:

このソリューションでは、data.tableのrbindlist関数を使用して data.tables のリストを "rbind"します。このためには、バージョン < 1.8.9でのこのバグのため、必ずバージョン 1.8.9 を使用してください。

注 2:

rbindlist現在のところ、data.frames/data.tables のリストをバインドすると、最初の列のデータ型が保持されます。つまり、最初の data.frame の列が文字で、2 番目の data.frame の同じ列が "factor" の場合、rbindlistこの列は文字になります。したがって、data.frame がすべての文字列で構成されている場合、このメソッドを使用したソリューションは plyr メソッドと同じになります。そうでない場合、値は同じままですが、一部の列は係数ではなく文字になります。後で自分で "factor" に変換する必要があります。この動作が将来変更されることを願っています。

そして今、ここで使用しています（およびfromとのdata.tableベンチマーク比較）：rbind.fillplyr

require(data.table)
rbind.fill.DT <- function(ll) {
    # changed sapply to lapply to return a list always
    all.names <- lapply(ll, names)
    unq.names <- unique(unlist(all.names))
    ll.m <- rbindlist(lapply(seq_along(ll), function(x) {
        tt <- ll[[x]]
        setattr(tt, 'class', c('data.table', 'data.frame'))
        data.table:::settruelength(tt, 0L)
        invisible(alloc.col(tt))
        tt[, c(unq.names[!unq.names %chin% all.names[[x]]]) := NA_character_]
        setcolorder(tt, unq.names)
    }))
}

rbind.fill.PLYR <- function(ll) {
    rbind.fill(ll)
}

require(microbenchmark)
microbenchmark(t1 <- rbind.fill.DT(ll), t2 <- rbind.fill.PLYR(ll), times=10)
# Unit: seconds
#                      expr      min        lq    median        uq       max neval
#   t1 <- rbind.fill.DT(ll)  10.8943  11.02312  11.26374  11.34757  11.51488    10
# t2 <- rbind.fill.PLYR(ll) 121.9868 134.52107 136.41375 184.18071 347.74724    10


# for comparison change t2 to data.table
setattr(t2, 'class', c('data.table', 'data.frame'))
data.table:::settruelength(t2, 0L)
invisible(alloc.col(t2))
setcolorder(t2, unique(unlist(sapply(ll, names))))

identical(t1, t2) # [1] TRUE

リストサイズが約 500 になるまで、この特定plyrのソリューションを超えるエッジがあることに注意してください。rbind.filldata.table

ベンチマークプロット:

リストの長さが data.frames である実行のプロットを次に示しseq(1000, 10000, by=1000)ます。microbenchmarkこれらの異なるリストの長さのそれぞれで10回の担当者を使用しました.

ここに画像の説明を入力

ベンチマークの要点:

誰かが結果を再現したい場合に備えて、ベンチマークの要点を次に示します。

score 15 · Accepted Answer

rbindlist(and rbind) forは、 v1.9.3 (開発バージョン)の最近の変更/コミットdata.tableにより機能と速度が向上し、より高速なバージョンのという名前が付けられました。私のこの回答は少し古すぎるようです。dplyrplyrrbind.fillrbind_all

に関連する NEWS エントリは次のrbindlistとおりです。

o  'rbindlist' gains 'use.names' and 'fill' arguments and is now implemented entirely in C. Closes #5249    
  -> use.names by default is FALSE for backwards compatibility (doesn't bind by 
     names by default)
  -> rbind(...) now just calls rbindlist() internally, except that 'use.names' 
     is TRUE by default, for compatibility with base (and backwards compatibility).
  -> fill by default is FALSE. If fill is TRUE, use.names has to be TRUE.
  -> At least one item of the input list has to have non-null column names.
  -> Duplicate columns are bound in the order of occurrence, like base.
  -> Attributes that might exist in individual items would be lost in the bound result.
  -> Columns are coerced to the highest SEXPTYPE, if they are different, if/when possible.
  -> And incredibly fast ;).
  -> Documentation updated in much detail. Closes DR #5158.

そのため、以下の比較的大きなデータで新しい (そしてより高速なバージョン) のベンチマークを行いました。

新しいベンチマーク:

200 ～ 300 の範囲の列を持つ合計 10,000 個の data.tables を作成し、バインド後の列の総数は 500 になります。

データを作成する関数:

require(data.table) ## 1.9.3 commit 1267
require(dplyr)      ## commit 1504 devel
set.seed(1L)
names = paste0("V", 1:500)
foo <- function() {
    cols = sample(200:300, 1)
    data = setDT(lapply(1:cols, function(x) sample(10)))
    setnames(data, sample(names)[1:cols])
}
n = 10e3L
ll = vector("list", n)
for (i in 1:n) {
    .Call("Csetlistelt", ll, i, foo())
}

そして、ここにタイミングがあります：

## Updated timings on data.table v1.9.5 - three consecutive runs:
system.time(ans1 <- rbindlist(ll, fill=TRUE))
#   user  system elapsed 
#  1.993   0.106   2.107 
system.time(ans1 <- rbindlist(ll, fill=TRUE))
#   user  system elapsed 
#  1.644   0.092   1.744 
system.time(ans1 <- rbindlist(ll, fill=TRUE))
#   user  system elapsed 
#  1.297   0.088   1.389 


## dplyr's rbind_all - Timings for three consecutive runs
system.time(ans2 <- rbind_all(ll))
#   user  system elapsed  
#  9.525   0.121   9.761 

#   user  system elapsed  
#  9.194   0.112   9.370 

#   user  system elapsed  
#  8.665   0.081   8.780 

identical(ans1, setDT(ans2)) # [1] TRUE

r - 異なる列を持つ data.frames を rbind する効率的な方法

4 に答える 4