r - by and drop 内で更新

Question

data.tableこんにちは皆さん、明けましておめでとうございます。各グループの選択に基づいて実行される更新を処理できるかどうか疑問に思っていました。

R) a=data.table(x=c("a","a","b","b","c","c"),y=c(1,2,3,3,2,1))
R) a
   x y
1: a 1
2: a 2
3: b 3
4: b 3
5: c 2
6: c 1

各 by-group 内の条件で更新したい場合は、で選択を行う必要がありますがj、これはより重要iなこと (選択) です。

R) a[,c:=ifelse(y==max(y),"yes","no"),by=x]
R) a
   x y   c
1: a 1  no
2: a 2 yes
3: b 3 yes
4: b 3 yes
5: c 2 yes
6: c 1  no

a[y==max(y),c:="yes",by=x,within.by=TRUE]はるかに高速になると思うようなオプションを使用して同じことを行うことはできますか

2 番目の質問は、でドロップ引数を取得するようにスケジュールされてdata.tableいます DT[drop="x,y,z"]か?DT[,':='(x=NULL,y=NULL,z=NULL)]

score 1 · Accepted Answer

This is just a guess, following and building on comments: which.max(x) may be faster than x==max(x).

From ?which.max :

Value of which.min and which.max
An integer of length 1 or 0 (iff x has no non-NAs), giving the index of the first minimum or maximum respectively of x. If this extremum is unique (or empty), the results are the same as (but more efficient than) which(x == min(x)) or which(x == max(x)) respectively.

So, maybe something like :

DT[,c:="no"]
w = DT[,list(IDX=.I[which.max(y)]),by=x]$IDX
DT[w,c:="yes"]

That uses i which might be what you're getting at. The result w is just one item per group, rather than .N per group, so it might be faster for that reason, too. Not just which.max alone per se. But of course if the max value can be tied then which.max will only return the first, so may not be appropriate depending on your data.

If you benchmark, ensure to make the data large (1GB+) and compare keyed by to unkeyed by as well.

r - by and drop 内で更新

1 に答える 1

Related

Reference