r - R 二次フィールド条件に基づいてデータフレームで一意のレコードを取得する

Question

更新および簡素化

次の構造を持つ非常に大きなテーブル (〜 700 万レコード) があります。

temp <- read.table(header = TRUE, stringsAsFactors=FALSE,
                   text = "Website Datetime    Rating
A 2007-12-06T14:53:07Z        1
A 2006-07-28T03:52:26Z        4
B 2006-11-02T11:06:25Z        2
C 2007-06-19T06:56:08Z        5
C 2009-11-28T22:27:58Z        2
C 2009-11-28T22:28:13Z        2")

私が取得したいのは、Web サイトごとに最大の評価を持つ一意の Web サイトです。

Website    Rating
A    4
B    2
C    5

for ループを使用してみましたが、遅すぎました。これを達成できる他の方法はありますか？

score 3 · Accepted Answer

 do.call( rbind, lapply( split(temp, temp$Website) , 
                               function(d) d[ which.max(d$Rating), ] ) )
  Website             Datetime Rating
A       A 2006-07-28T03:52:26Z      4
B       B 2006-11-02T11:06:25Z      2
C       C 2007-06-19T06:56:08Z      5

'Datetime' 変数はまだ実際には Date または datetime オブジェクトのようには見えないため、最初に Date-object に変換する必要があります。

which.max最大である最初の項目を選択します。

>  which.max(c(1,1,2,2))
[1] 3

したがって、その点に関するアナンダの警告は正しくないかもしれません。データテーブルメソッドは確かにより高速であり、マシンのメモリが適度な場合でも成功する可能性があります。上記の方法は、途中でいくつかのコピーを作成する可能性があり、data.table 関数はそれほど多くのコピーを行う必要はありません。

score 2 · Accepted Answer

I would probably explore the data.table package, though without more details, the following example solution is most likely not going to be what you need. I mention this because, in particular, there might be more than one "Rating" record per group which matches max; how would you like to deal with those cases?

library(data.table)
temp <- read.table(header = TRUE, stringsAsFactors=FALSE,
                text = "Website Datetime    Rating
                        A       2012-10-9   10
                        A       2012-11-10  12
                        B       2011-10-9   5")
DT <- data.table(temp, key="Website")
DT
#    Website   Datetime Rating
# 1:       A  2012-10-9     10
# 2:       A 2012-11-10     12
# 3:       B  2011-10-9      5
DT[, list(Datetime = Datetime[which.max(Rating)], 
          Rating = max(Rating)), by = key(DT)]
#    Website   Datetime Rating
# 1:       A 2012-11-10     12
# 2:       B  2011-10-9      5

I would recommend that to get better answers, you might want to include information like how your datetime variable might factor into your aggregation, or whether it is possible for there to be more than one "max" value per group.

If you want all the rows that match the max, the fix is easy:

DT[, list(Time = Times[Rating == max(Rating)], 
          Rating = max(Rating)), by = key(DT)]

If you do just want the Rating column, there are many ways to go about this. Following the same steps as above to convert to a data.table, try:

DT[, list(Datetime = max(Rating)), by = key(DT)]
     Website Datetime
# 1:       A        4
# 2:       B        2
# 3:       C        5

Or, keeping the original "temp" data.frame, try aggregate():

aggregate(Rating ~ Website, temp, max)
    Website Rating
# 1       A      4
# 2       B      2
# 3       C      5

Yet another approach, using ave:

temp[with(temp, Rating == ave(Rating, Website, FUN=max)), ]

r - R 二次フィールド条件に基づいてデータ フレームで一意のレコードを取得する

2 に答える 2

Related

Reference

r - R 二次フィールド条件に基づいてデータフレームで一意のレコードを取得する