r - 参照ルックアップ後にdata.tableをそれ自体とマージします

Question

私が持っているdata.tables DT場合neighbors：

set.seed(1)
library(data.table)
DT <- data.table(idx=rep(1:10, each=5), x=rnorm(50), y=letters[1:5], ok=rbinom(50, 1, 0.90))
n <- data.table(y=letters[1:5], y1=letters[c(2:5,1)])

nルックアップテーブルです。いつでも、対応するを検索し、その値をとに使用したいok == 0と思います。例として、DTの行4：y1nxidx

> DT
   idx          x y ok
1:   1 -0.6264538 a  1
2:   1  0.1836433 b  1
3:   1 -0.8356286 c  1
4:   1  1.5952808 d  0
5:   1  0.3295078 e  1
6:   2 -0.8204684 a  1

y1from nfordはe：_

> n[y == 'd']
   y y1
1: d  e

idx行4の場合は1です。したがって、次を使用します。

> DT[idx == 1 & y == 'e', x]
[1] 0.3295078

すべての値を適切なn['y1']値に置き換えて、出力をdata.table同じようにしたいと思います。DT[ok == 0]xx

> output
   idx          x y ok
1:   1  0.3295078 d  0
2:   2 -0.3053884 d  0
3:   3  0.3898432 a  0
4:   5  0.7821363 a  0
5:   7  1.3586800 e  0
6:   8  0.7631757 d  0

plyrベースRまたは...そしておそらく金曜日の遅くにこれを行ういくつかの方法を考えることができます...しかし、これが必要とするマージのシーケンスが何であれ、data.table私を超えています！

score 8 · Accepted Answer

素晴らしい質問です。他の回答の関数を使用し、Blueの回答を関数にラップするとblue、次のようになります。ベンチマークには、setkeyすべての場合のまでの時間が含まれます。

red = function() {
    ans = DT[ok==0]
      # Faster than setkey(DT,ok)[J(0)] if the vector scan is just once
      # If lots of lookups to "ok" need to be done, then setkey may be worth it
      # If DT[,ok:=as.integer(ok)] can be done first, then ok==0L slightly faster

    # After extracting ans in the original order of DT, we can now set the key :
    setkey(DT,idx,y)
    setkey(n,y)

    # Now working with the reduced ans ...

    ans[,y1:=n[y,y1,mult="first"]]
    # Add a new column y1 by reference containing the lookup in n
    # mult="first" because we know n's key is unique, for speed (to save looking
    # for groups of matches in n). Future version of data.table won't need this.
    # Also, mult="first" has the advantage of dropping group columns (so we don't
    # need [[2L]]). mult="first"|"last" turns off by-without-by of mult="all".

    ans[,x:=DT[ans[,list(idx,y1)],x,mult="first"]]
    # Changes the contents of ans$x by reference. The ans[,list(idx,y1)] part is
    # how to pick the columns of ans to join to DT's key when they are not the key
    # columns of ans and not the first 1:n columns of ans. There is no need to key
    # ans, especially since that would change ans's order and not strictly answer
    # the question. If idx and y1 were columns 1 and 2 of (unkeyed) ans then we
    # wouldn't need that part, just
    #    ans[,x:=DT[ans,x,mult="first"]]
    # would do (relying on DT having 2 columns in its key). That has the advantage
    # of not copying the idx and y1 columns into a new data.table to pass as the i
    # DT. To save that copy y1 could be moved to column 2 using setcolorder first.

    redans <<- ans
    }

crdt(1e5)
origDT = copy(DT)
benchmark(blue={DT=copy(origDT); system.time(blue())},
          red={DT=copy(origDT); system.time(red())},
          fun={DT=copy(origDT); system.time(fun(DT,n))},
          replications=3, order="relative")

test replications elapsed relative user.self sys.self user.child sys.child
 red            3   1.107    1.000     1.100    0.004          0         0
blue            3   5.797    5.237     5.660    0.120          0         0
 fun            3   8.255    7.457     8.041    0.184          0         0

crdt(1e6)
[ .. snip .. ]
test replications elapsed relative user.self sys.self user.child sys.child
 red            3  14.647    1.000    14.613    0.000          0         0
blue            3  87.589    5.980    87.197    0.124          0         0
 fun            3 197.243   13.466   195.240    0.644          0         0

identical(blueans[,list(idx,x,y,ok,y1)],redans[order(idx,y1)])
# [1] TRUE

orderで必要なidenticalのはred、と同じ順序で結果を返すのDT[ok==0]に対し、で同点の場合はblueによって順序付けられているように見えるためです。y1idx

結果に不要な場合は、;y1を使用して（テーブルサイズに関係なく）即座に削除できます。ans[,y1:=NULL]つまり、タイミングにまったく影響を与えることなく、問題の要求された正確な結果を生成するために、これを上記に含めることができます。

score 4 · Accepted Answer

library(data.table)

crdt <- function(i=10){
 set.seed(1)
 DT <<- data.table(idx=rep(1:i, each=5), x=rnorm(5*i), 
                   y=letters[1:5], ok=rbinom(5*i, 1, 0.90))
 n <<- data.table(y=letters[1:5], y1=letters[c(2:5,1)])
} 

fun <- function(DT,n){
 setkey(DT,ok)
 n1 <- merge(n,DT[J(0),list(y,idx)],by="y")
 DT[J(0),x:=DT[paste0(y,idx) %in% paste0(n1[,y1],n1[,idx]),x]]
} 

crdt(10)
fun(DT,n)[J(0)]
     ok idx          x y
[1,]  0   1  0.3295078 d
[2,]  0   2 -0.3053884 d
[3,]  0   3  0.3898432 a
[4,]  0   5  0.7821363 a
[5,]  0   7  1.3586796 e
[6,]  0   8  0.7631757 d

しかし、それでもビッグデータの場合はかなり遅いです。

crdt(1e6)
system.time(fun(DT,n)[J(0)])
       User      System     elapsed 
      4.213       0.162       4.374 

crdt(1e7)
system.time(fun(DT,n)[J(0)])
       User      System     elapsed 
    195.685       3.949     199.592

より速い解決策を学ぶことに興味があります。

score 4 · Accepted Answer

超複雑な答え：

setkey(
    setkey(
        setkey(DT,y)[setkey(n,y),nomatch=0] #inner joins DT to n
    #matches the new x value by idx and y, and assigns it
    ,idx,y1)[setkey(J(idx,y,new.x=x),idx,y),x:=new.x] 
,ok)[list(0)] #pulls things where ok == 0

ローランドの答えは小さいテーブルの方が良いように見えますが、私のものは最終的に大きいサイズに追いつきます。でも、あまりチェックはしていません。

> library(rbenchmark)
> benchmark(fun(DT,n)[J(0)],setkey(setkey(setkey(DT,y)[setkey(n,y),nomatch=0],idx,y1)[setkey(J(idx,y,new.x=x),idx,y),x:=new.x],ok)[list(0)])
                                                                                                                                  test
1                                                                                                                     fun(DT, n)[J(0)]
2 setkey(setkey(setkey(DT, y)[setkey(n, y), nomatch = 0], idx, y1)[setkey(J(idx, y, new.x = x), idx, y), `:=`(x, new.x)], ok)[list(0)]
  replications elapsed relative user.self sys.self user.child sys.child
1          100   13.21 1.000000     13.08     0.02         NA        NA
2          100   15.08 1.141559     14.76     0.06         NA        NA
> crdt(1e5)
> benchmark(fun(DT,n)[J(0)],setkey(setkey(setkey(DT,y)[setkey(n,y),nomatch=0],idx,y1)[setkey(J(idx,y,new.x=x),idx,y),x:=new.x],ok)[list(0)])
                                                                                                                                  test
1                                                                                                                     fun(DT, n)[J(0)]
2 setkey(setkey(setkey(DT, y)[setkey(n, y), nomatch = 0], idx, y1)[setkey(J(idx, y, new.x = x), idx, y), `:=`(x, new.x)], ok)[list(0)]
  replications elapsed relative user.self sys.self user.child sys.child
1          100  150.49 1.000000    148.98     0.89         NA        NA
2          100  155.33 1.032162    151.04     2.25         NA        NA
>

r - 参照ルックアップ後にdata.tableをそれ自体とマージします

3 に答える 3

Related

Reference