r - 時系列データフレームでの検索の最適化

Question

時系列を表す、R に 50 列 x 250 万行のデータフレームがあります。time カラムはクラス POSIXct です。分析のために、特定の時間における特定のクラスのシステムの状態を繰り返し見つける必要があります。

私の現在のアプローチは次のとおりです（単純化され、再現可能です）：

set.seed(1)
N <- 10000
.time <- sort(sample(1:(100*N),N))
class(.time) <- c("POSIXct", "POSIXt")
df <- data.frame(
  time=.time,
  distance1=sort(sample(1:(100*N),N)),
  distance2=sort(sample(1:(100*N),N)),
  letter=sample(letters,N,replace=TRUE)
)

# state search function
time.state <- function(df,searchtime,searchclass){
  # find all rows in between the searchtime and a while (here 10k seconds)
  # before that
  rows <- which(findInterval(df$time,c(searchtime-10000,searchtime))==1)
  # find the latest state of the given class within the search interval
  return(rev(rows)[match(T,rev(df[rows,"letter"]==searchclass))])
}  

# evaluate the function to retrieve the latest known state of the system
# at time 500,000.
df[time.state(df,500000,"a"),]

ただし、への呼び出しwhichは非常にコストがかかります。または、最初にクラスでフィルター処理してから時間を見つけることもできますが、評価時間はそれほど変わりません。Rprof によると、それwhichが==時間の大部分を占めています。

より効率的な解決策はありますか？時点は、弱い昇順でソートされます。

score 1 · Accepted Answer

which、==およびはすべてデータフレームのサイズに比例するため[、解決策は次のように一括操作用のサブセットデータフレームを生成することです。

# function that applies time.state to a series of time/class cominations
time.states <- function(df,times,classes,day.length=24){
  result <- vector("list",length(times))
  day.end <- 0
  for(i in 1:length(times)){
    if(times[i] > day.end){
      # create subset interval from 1h before to 24h after
      day.begin <- times[i]-60*60
      day.end <- times[i]+day.length*60*60
      df.subset <- df[findInterval(df$time,c(day.begin,day.end))==1,]
    }
    # save the resulting row from data frame
    result[[i]] <- df.subset[time.state(df.subset,times[i],classes[i]),]
  }
  return(do.call("rbind",result))
}

dT=diff(range(df$times))とdT/day.length大きい場合、これにより評価時間が 1/1 に短縮されますdT/(day.length+1)。

r - 時系列データ フレームでの検索の最適化

1 に答える 1

Related

Reference

r - 時系列データフレームでの検索の最適化