r - data.frame の隣接する列を使用してラグを作成する

Question

次のような大規模なデータセットがあります。

set.seed(1234)
id <- c(3,3,3,5,5,7)
amount <- c(24,48,60,84,96,175)
start <- as.Date(c("2006-01-01","2009-12-09","2010-01-01","2006-04-24", "2009-12-09","2009-05-01"))
end <- as.Date(c("2010-01-01","2010-01-01","2010-01-01","2009-12-09","2009-12-09", "2009-05-01"))               
noise <-rnorm(6)
test <- data.frame(id,amount,start,end,noise)            

  id amount      start        end      noise
   3     24 2006-01-01 2010-01-01  0.4978505
   3     48 2009-12-09 2010-01-01 -1.9666172
   3     60 2010-01-01 2010-01-01  0.7013559
   5     84 2006-04-24 2009-12-09 -0.4727914
   5     96 2009-12-09 2009-12-09 -1.0678237
   7    175 2009-05-01 2009-05-01 -0.2179749

ただし、次のようにする必要があります。

  id amount      start        end      noise   switch
   3     24 2006-01-01 2009-12-09  0.4978505        0
   3     48 2009-12-09 2010-01-01 -1.9666172        1
   3     60 2010-01-01 2010-01-01  0.7013559        2
   5     84 2006-04-24 2009-12-09 -0.4727914        0 
   5     96 2009-12-09 2009-12-09 -1.0678237        1
   7    175 2009-05-01 2009-05-01 -0.2179749        0

つまり、ID によって start の値を遅らせ、end の値をそれらに置き換えたいと考えています。2 番目に、初期条件の最初の観察が == 0 である ID で「金額」が何回変化したかをカウントする「switch」という新しい変数を作成したいと思います。ts()Date ではなく ts オブジェクトを生成しますが、原則として必要なことを行うラグを作成するために使用してみました。

       out <- cbind(as.ts(test$start),lag(test$start))
       colnames(out) <- c("start","end")
       cbind(as.ts(test$start),lag(test$start))

         as.ts(test$start) lag(test$start)
            NA           13149
          13149           14587
          14587           14610
          14610           13262
          13262           14587
          14587           14365
          14365              NA

したがって、lag(test$start)列は私の最後がどのように見えるべきかですが、id 変数に適用されます。そこで、ベクトル化して id 変数に適用しようとしました。

        #make it a function 
        lagfun <- function(x){
          cbind(as.ts(x),lag(x))
        }

        y <- unlist(tapply(start,id,lagfun))

そして、これは物事が本当に醜くなるところです。これについてもっと良い方法はありますか？

score 5 · Accepted Answer

時系列をに入れると、data.tableこれを 1 行で実行できます。

testDT[ , c("end", "switch") := 
          list( c(tail(start, -1), tail(end, 1)), cumsum(c(0, diff(amount) != 0)))
      , by=id]

ここでそれは分解されます：

# create your data.table object 
library(data.table)
testDT <- data.table(test)


# Modify `end` by taking the lag of start and the final date from end. 
#   do this `by=id`
testDT[, end := c(tail(start, -1), tail(end, 1)), by=id]

# Count the ammount of times that each amount differs from the 
#  previous ammount value.  
# Start this vector at 0, and take the cummulative sum. 
#  also do this by id 
testDT[, switch := cumsum(c(0, diff(amount) != 0)), by=id]

# this is the final result. 
testDT
   id amount      start        end      noise switch
1:  3     24 2006-01-01 2009-12-09 -1.2070657      0
2:  3     48 2009-12-09 2010-01-01  0.2774292      1
3:  3     60 2010-01-01 2010-01-01  1.0844412      2
4:  5     84 2006-04-24 2009-12-09 -2.3456977      0
5:  5     96 2009-12-09 2009-12-09  0.4291247      1
6:  7    175 2009-05-01 2009-05-01  0.5060559      0

r - data.frame の隣接する列を使用してラグを作成する

1 に答える 1

Related

Reference