r - 特定の条件下で最新のエントリを抽出する

Question

私は次のことを成し遂げるのに苦労しています：

データセットの例:

   belongID   uniqID   Time   Rating  
   1           101       5      0  
   1           102       4      0  
   2           103       4      0  
   2           104       3      0  
   2           105       2      5
   3           106       4      2  
   3           107       5      0  
   3           108       5      1

問題は、この評価が 0 でない限り、所属 ID ごとに最新のエントリ (時間の最大値) を抽出したいということです。ただし、最新のエントリの評価が 0 の場合。評価のある最初のエントリが必要です (最高の評価ではなく、評価がゼロではない最初の値のみ)。他のすべてのエントリもゼロの場合は、最新のものを選択する必要があります。

最終結果は次のようになります。

   belongID   uniqID   Time   Rating  
   1           101       5      0  
   2           105       2      5
   3           108       5      1

データセットはかなり大きく、所属 ID で並べられています。時間順に並べられていないため、同じ所属 ID を持つ古いエントリの後に新しいエントリが来る場合があります。

「0 評価」制約を使用せずに、次の関数を使用して最新のエントリを計算しました。

>uniqueMax <- function(m, belongID = 1, time = 3) {
    t(
      vapply(
         split(1:nrow(m), m[,belongID]), 
         function(i, x, time) x[i, , drop=FALSE][which.max(x[i,time]),], m[1,], x=m, time=time
      )
    )
 }

「0 評価」制約を組み込む方法がわかりません。

編集：フォローアップの質問：

getRatingゼロを評価するだけでなく、より多くの評価を考慮する必要がある場合 (たとえば、0、1、4、および 5) 、関数をどのように変更する必要があるかを誰かが知っていますか? したがって、レーティング 0 または 1 または 4 または 5 でない限り、最新のものに割り当てますか? 評価が 0、1、4、5 の場合、別の評価を持つ最新のエントリに割り当てます。すべての評価が 0、1、4、または 5 の場合、最新のものに割り当てます。次のことを試しましたが、うまくいきませんでした：

getRating <- function(x){
  iszero <- x$Rating == 0 | x$Rating == 1 | x$Rating == 4 | x$Rating ==5
  if(all(iszero)){
    id <- which.max(x$Time)
  } else {
    id <- which.max((!iszero)*x$Time) 
            # This trick guarantees taking 0 into account
  }
  x[id,]
}
# Do this over the complete data frame
do.call(rbind,lapply(split(Data,Data$belongID),getRating)) 
     # edited per Tyler's suggestion'

score 4 · Accepted Answer

これが私のクラックです（興味深い問題）：

データを読み込む:

m <- read.table(text="belongID   uniqID   Time   Rating  
   1           101       5      0  
   1           102       4      0  
   2           103       4      0  
   2           104       3      0  
   2           105       2      5
   3           106       4      2  
   3           107       5      0  
   3           108       5      1 ", header=T)

要求した行を抽出する:

m2 <- m[order(m$belongID, -m$Time), ]                 #Order to get max time first
LIST <- split(m2, m$belongID)                         #split by belongID
FUN <- function(x) which(cumsum(x[, 'Rating'])!=0)[1] #find first non zero Rating
LIST2 <- lapply(LIST, function(x){                    #apply FUN; if NA do 1st row
        if (is.na(FUN(x))) {
            x[1, ]
        } else {
            x[FUN(x), ]
        }
    }
)
do.call('rbind', LIST2)                              #put it all back together

どちらが得られますか:

  belongID uniqID Time Rating
1        1    101    5      0
2        2    105    2      5
3        3    108    5      1

編集非常に多くの人々がこの問題に答えているため(IMHOを解決するのは楽しい)、マイクロベンチマークテスト(Windows 7)を求めました:

Unit: milliseconds
    expr       min        lq    median        uq      max
1   JIGR  6.356293  6.656752  7.024161  8.697213 179.0884
2 JORRIS  2.932741  3.031416  3.153420  3.552554 246.9604
3  PETER 10.851046 11.459896 12.358939 17.164881 216.7284
4  TYLER  2.864625  2.961667  3.066174  3.413289 221.1569

そしてグラフ：

ここに画像の説明を入力

score 3 · Accepted Answer

これは、data.tableを使用して、フィルタリングと関数の実行を簡単にするためのソリューションgetRecentRowですbelongID。

library(data.table)

# Load the data from the example.
dat = structure(list(belongID = c(1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L), 
          uniqID = 101:108, Time = c(5L, 4L, 4L, 3L, 2L, 4L, 5L, 5L),
          Rating = c(0L, 0L, 0L, 0L, 5L, 2L, 0L, 1L)), 
          .Names = c("belongID", "uniqID", "Time", "Rating"),
          row.names = c(NA, -8L), class = c("data.table", "data.frame"))

dat = data.table(dat) # Convert to data table.

# Function to get the row for a given belongID
getRecentRow <- function(data) {
    # Filter by Rating, then order by time, then select first.
    row = data[Rating != 0][order(-Time)][1]

    if(!is.na(row$uniqID)) {
        # A row was found with Rating != 0, return it.
        return(row)
     } else {
          # The row was blank, so filter again without restricting. rating.
          return(data[order(-Time)][1])
        }  
}

# Run getRecentRow on each chunk of dat with a given belongID
result = dat[,getRecentRow(.SD), by=belongID]

     belongID uniqID Time Rating
[1,]        1    101    5      0
[2,]        2    105    2      5
[3,]        3    108    5      1

score 3 · Accepted Answer

1つの提案は次のとおりです。

library(plyr)

maxV <- function(b) {
    if (b[which.max(b$Time), "Rating"]  != 0) {
        return(b[which.max(b$Time), ])
    } else if (!all(b$Rating==0)) {
        bb <- b[order(b$Rating), ]
        return(bb[bb$Rating != 0,][1, ])
    } else {
        return(b[which.max(b$Time),])
    }
}

a <- read.table(textConnection(" belongID   uniqID   Time   Rating  
   1           101       5      0  
   1           102       4      0  
   2           103       4      0  
   2           104       3      0  
   2           105       2      5
   3           106       4      2  
   3           107       5      0  
   3           108       5      1 "), header=T)

ddply(a, .(belongID), maxV)
  belongID uniqID Time Rating
1        1    101    5      0
2        2    105    2      5
3        3    108    5      1

score 3 · Accepted Answer

編集：

速度が主な関心事であるため、私のトリックを最初のソリューションに編集すると、次のようになります。

uniqueMax <- function(m, belongID = 1, time = 3) {
  t(
    vapply(
      split(1:nrow(m), m[,belongID]), 
      function(i, x, time){ 
        is.zero <- x[i,'Rating'] == 0
        if(all(is.zero)) is.zero <- FALSE
        x[i, , drop=FALSE][which.max(x[i,time]*(!is.zero)),]
      }
      , m[1,], x=m, time=time
      )
    )
}

以前のものよりも少し読みやすい私の元のソリューション：

# Get the rating per belongID
getRating <- function(x){
  iszero <- x$Rating == 0
  if(all(iszero)){
    id <- which.max(x$Time)
  } else {
    id <- which.max((!iszero)*x$Time) 
            # This trick guarantees taking 0 into account
  }
  x[id,]
}
# Do this over the complete data frame
do.call(rbind,lapply(split(Data,Data$belongID),getRating)) 
     # edited per Tyler's suggestion

結果：

tc <- textConnection('
belongID   uniqID   Time   Rating  
   1           101       5      0  
   1           102       4      0  
   2           103       4      0  
   2           104       3      0  
   2           105       2      5
   3           106       4      2  
   3           107       5      0  
   3           108       5      1 ')

Data <- read.table(tc,header=TRUE)

do.call(rbind,lapply(split(Data,Data$belongID),getRating))

与える：

  belongID uniqID Time Rating
1        1    101    5      0
2        2    105    2      5
3        3    108    5      1

編集: 楽しみのために、rbenchmark1000回のレプリケーションを持つ小さなデータセットと10回のレプリケーションを持つ大きなデータセットで(を使用して)ベンチマークも行いました:

結果：

> benchmark(Joris(Data),Tyler(Data),uniqueMax(Data),
+           columns=c("test","elapsed","relative"),
+           replications=1000)
             test elapsed relative
1     Joris(Data)    1.20 1.025641
2     Tyler(Data)    1.42 1.213675
3 uniqueMax(Data)    1.17 1.000000

> benchmark(Joris(Data2),Tyler(Data2),uniqueMax(Data2),
+           columns=c("test","elapsed","relative"),
+           replications=10)
              test elapsed relative
1     Joris(Data2)    3.63 1.174757
2     Tyler(Data2)    4.02 1.300971
3 uniqueMax(Data2)    3.09 1.000000

ここでは、関数 Joris() と Tyler() をソリューションにラップし、次のように Data2 を作成しました。

Data2 <- data.frame(
  belongID = rep(1:1000,each=10),
  uniqID = 1:10000,
  Time = sample(1:5,10000,TRUE),
  Rating = sample(0:5,10000,TRUE)
  )

r - 特定の条件下で最新のエントリを抽出する

4 に答える 4

Related

Reference