顧客のトランザクションデータで構成されるデータセットがあります。特定のイベントが発生したときのタイムスタンプがあります。特定のイベントの前に発生したイベントのみを取得したいのですが、特定の時点だけでなく、顧客ごとに取得したいと思います。
スナップショットは次のとおりです。
custId date_time_recorded event dateTime1
1 280512544 2012-11-13 15:25:37.947-08 shipped 2012-11-13 15:25:37
2 280512544 2012-11-13 15:22:42.614-08 statusCheck 2012-11-13 15:22:42
3 280512544 2012-11-13 15:03:16.62-08 statusCheck 2012-11-13 15:03:16
4 280512544 2012-11-13 15:01:35.149-08 statusCheck 2012-11-13 15:01:35
5 280512544 2012-11-13 14:45:41.964-08 status-picked 2012-11-13 14:45:41
6 280512544 2012-11-13 14:44:57.664-08 warehouse_notified 2012-11-13 14:44:57
7 280512544 2012-11-13 14:44:57.644-08 statusCheck 2012-11-13 14:44:57
8 280512544 2012-11-13 13:05:15.725-08 recordCreated 2012-11-13 13:05:15
9 280510610 2012-11-13 09:22:36.427-08 shipped 2012-11-13 09:22:36
10 280510610 2012-11-13 09:20:07.202-08 statusCheck 2012-11-13 09:20:07
11 280510610 2012-11-13 09:14:56.182-08 statusCheck 2012-11-13 09:14:56
「出荷済み」イベントの前に発生したイベントのみを取得したいのですが。私は現在これを達成するためにddplyを使用していますが、時間がかかります。
keepPreShip <- function(x){
shipTime <- fastPOSIXct(x[grep("shipped", x$event, ignore.case = T), "date_time_recorded"],tz = "UTC")
#shipTime <- fastPOSIXct(x[x$event =="shipped", "date_time_recorded"],tz = "UTC")
x <- x[x$dateTime1 < shipTime,]
}
system.time(eventsMain1 <- ddply(ss1, .(custId), keepPreShip ))
これを行うためのより速い方法はありますか?多分とdata.table
?
dput
データは次のとおりです。
> dput(ss1)
structure(list(custId = c(280512544L, 280512544L, 280512544L,
280512544L, 280512544L, 280512544L, 280512544L, 280512544L, 280510610L,
280510610L, 280510610L, 280510610L, 280510610L, 280510610L, 280510610L,
280510610L, 280511123L, 280511123L, 280511123L, 280511123L),
date_time_recorded = c("2012-11-13 15:25:37.947-08", "2012-11-13 15:22:42.614-08",
"2012-11-13 15:03:16.62-08", "2012-11-13 15:01:35.149-08",
"2012-11-13 14:45:41.964-08", "2012-11-13 14:44:57.664-08",
"2012-11-13 14:44:57.644-08", "2012-11-13 13:05:15.725-08",
"2012-11-13 09:22:36.427-08", "2012-11-13 09:20:07.202-08",
"2012-11-13 09:14:56.182-08", "2012-11-13 09:11:40.438-08",
"2012-11-13 09:03:51.571-08", "2012-11-13 09:03:51.461-08",
"2012-11-13 09:03:49.174-08", "2012-11-13 06:42:10.208-08",
"2012-11-13 13:51:05.039-08", "2012-11-13 13:13:16.452-08",
"2012-11-13 12:42:08.917-08", "2012-11-13 12:28:51.541-08"
), event = c("shipped", "statusCheck", "statusCheck", "statusCheck",
"status-picked", "warehouse_notified", "statusCheck", "recordCreated",
"shipped", "statusCheck", "statusCheck", "statusCheck", "status-picked",
"warehouse_notified", "statusCheck", "recordCreated", "shipped",
"statusCheck", "statusCheck", "statusCheck"), dateTime1 = structure(c(1352820337.947,
1352820162.614, 1352818996.62, 1352818895.149, 1352817941.964,
1352817897.664, 1352817897.644, 1352811915.725, 1352798556.427,
1352798407.202, 1352798096.182, 1352797900.438, 1352797431.571,
1352797431.461, 1352797429.174, 1352788930.208, 1352814665.039,
1352812396.452, 1352810528.917, 1352809731.541), class = c("POSIXct",
"POSIXt"), tzone = "UTC")), .Names = c("custId", "date_time_recorded",
"event", "dateTime1"), row.names = c(NA, 20L), class = "data.frame")