r - 期間バケット

Question

開始時刻、終了時刻、カテゴリ ID、およびカウントを含むイベントのログがあります。彼らは数ヶ月をカバーしています。

特定の日、週、月のヒストグラムを追跡できるように、時間の経過とともにそれらを集計したいと思います。したがって、これを行う最善の方法は、期間をバケットにビン化することだと思います。5分でいいと思います。

たとえば、イベントが午後 1 時 1 分に開始し、午後 1 時 7 分に終了する場合、5 分の 2 つの期間 (0 ～ 5 および 5 ～ 10) をカバーするため、2 つのレコードを取得し、残りの元のデータを複製します。これらの新しいレコード (カテゴリとカウント)

入力ログ (x) がそのままの場合:

start / end / catid / count     
2012-11-17 15:05:02.0,  2012-11-17 15:12:52.0,  1, 2    
2012-11-17 15:07:13.0,  2012-11-17 15:17:47.0,  2, 10   
2012-11-17 15:11:00.0,  2012-11-17 15:12:33.0,  3, 5    
2012-11-17 15:12:01.0,  2012-11-17 15:20:00.0,  4, 1

この方法で、出力を 5 分 (b) でバケット化しようとしています。

start / catid / count   
2012-11-17 15:05:00.0   1, 2    
2012-11-17 15:10:00.0   1, 2

2012-11-17 15:05:00.0   2, 10   
2012-11-17 15:10:00.0   2, 10
2012-11-17 15:15:00.0   2, 10

2012-11-17 15:10:00.0   3, 5

2012-11-17 15:10:00.0   4, 1
2012-11-17 15:15:00.0   4, 1

次に、必要な期間 (時間、日、週、月) のカテゴリ ID で新しいデータフレーム (b) を簡単に集計できます。

私はRから始めていますが、時間の値をバケット化する方法について多くの説明を見つけましたが、期間ではありません。Zoo と xts を調べましたが、どうすればよいかわかりませんでした。

うまくいけば、それはあなたの何人かにとって理にかなっています.

編集：

元の終了時間ではなく、丸められた終了時間を使用してブロックの正しい計算を取得するように、Ram の提案を少し変更しました。（ありがとうラム！）

mnslot=15 # size of the buckets/slot in minutes

#Round down the minutes of starttime to a mutliple of mnslot
st.str <- strptime(st, "%Y-%m-%d %H:%M:%S")
min_st <- as.numeric(format(st.str, "%M"))
roundedmins <- floor(min_st/mnslot) * mnslot
st.base <- strptime(st, "%Y-%m-%d %H")
rounded_start <- st.base + (roundedmins * 60)

#Round down the minutes of the endtime to a multiple of mnslot.
en.str <- strptime(en, "%Y-%m-%d %H:%M:%S")
min_en <- as.numeric(format(en.str, "%M"))
roundedmins <- floor(min_en/mnslot) * mnslot
en.base <- strptime(en, "%Y-%m-%d %H")
rounded_end<- en.base + (roundedmins * 60)

# calculate the number of blocks based on the rounded minutes of start and end
numblocks<- as.numeric(floor((rounded_end-rounded_start)/mnslot/60)+1)
# differenced of POSIXct values is in minutes
# but difference of POSIXlt seems to be in seconds , so have to divide by 60 as well

#Create REPLICATED Rows, depending on the size of the interval
replicated_cat = NULL
replicated_count = NULL
replicated_start =     NULL
for (n in 1:length(numblocks)){
  for (newrow in  1:numblocks[n]){
    replicated_start =   c(replicated_start, df$rounded_start[n]+(newrow-1)*300   )  
    replicated_cat = c(replicated_cat,    df$catid[n]) 
    replicated_count = c(replicated_count, df$count[n]) 
  }
}

#Change to readable format
POSIXT <- unix2POSIXct(replicated_start)

newdf <- data.frame(POSIXT, replicated_cat, replicated_count)
names(newdf) <- c("start", "CatId", "Count")
newdf

これにより、必要な出力が生成されます。ただし、少し遅いです：p

score 2 · Accepted Answer

これは完全に機能するバージョンです。それには、あなたが求めているもののための段階的なデータ操作が含まれます.

#storing the original data as a csv
df <- read.csv("tsdata.csv")
st<-as.POSIXlt(df$start)
en<-as.POSIXlt(df$end)

#a utility function to convert formats
unix2POSIXct  <-  function (time)   structure(time, class = c("POSIXt", "POSIXct") )

#For each row, determine how many replications are needed
numdups <- as.numeric(floor((en-st)/5)+1)

st.str <- strptime(st, "%Y-%m-%d %H:%M:%S")
min_st <- as.numeric(format(st.str, "%M"))

#Round down the minutes of start to 5 minute starts. 0,5,10 etc...
roundedmins <- floor(min_st/5) * 5
st.base <- strptime(st, "%Y-%m-%d %H")
df$rounded_start <- st.base + (roundedmins * 60)


#Create REPLICATED Rows, depending on the size of the interval
replicated_cat = NULL
replicated_count = NULL
replicated_start =     NULL
for (n in 1:length(numdups)){
  for (newrow in  1:numdups[n]){
    replicated_start =   c(replicated_start, df$rounded_start[n]+(newrow-1)*300   )  
    replicated_cat = c(replicated_cat,    df$catid[n]) 
    replicated_count = c(replicated_count, df$count[n]) 
  }
}

#Change to readable format
POSIXT <- unix2POSIXct(replicated_start)

newdf <- data.frame(POSIXT, replicated_cat, replicated_count)
names(newdf) <- c("start", "CatId", "Count")
newdf

生成するもの:

                start CatId Count
1 2012-11-17 15:05:00     1     2
2 2012-11-17 15:10:00     1     2
3 2012-11-17 15:05:00     2    10
4 2012-11-17 15:10:00     2    10
5 2012-11-17 15:15:00     2    10
6 2012-11-17 15:10:00     3     5
7 2012-11-17 15:10:00     4     1
8 2012-11-17 15:15:00     4     1

score 0 · Accepted Answer

それは簡単なことではありません...私は問題全体の構造も見逃しているので、基本的なアプローチの概要を説明することに限定していただければ幸いです。不明な点がある場合は、私に戻ってきてください。最初に (もし私があなたなら) ' lubridate ' パッケージをインストールします。これにより、日付/時刻をいじりやすくなります。次に、次のようなことを試してください。

z <- strptime("17/11/12 15:05:00.0", "%d/%m/%y %H:%M:%OS")

これにより、開始時点が定義されます。それが最初のlogs(x)時間によって定義されることになっている場合は、分コマンドが利用可能です。

z <- strptime("17/11/12 15:05:02.0", "%d/%m/%y %H:%M:%OS")
minute(z)<-5;second(z)<-0.0 #I guess, you get the concept

次に、5分間隔のシーケンスを生成します

z5s<-z+minutes(seq(0,100,5))

これにより、20 分の 5 分の時間間隔のシーケンスが生成されます。ここでも、全体がどれほど柔軟であるかはわかりません。

最後に、たとえばモジュロ演算で遊ぶことができます

z2<-z+minutes(2)

z2 は終了時間である必要があります。ここでは、概念を説明するために「手動で」2 分追加しました。

(as.integer(z2-z))%%5 > 5 
FALSE

または、カバーされている 5 分間のスパンの数を確認したい場合は (as.integer(z2-z))%%5 、z5s POSIXlt 間隔全体でログ時間を一致させる/分散することを好む他の機能のみを行います。

これが少し役立つことを願っています。つまり、方向性を教えてくれます。

r - 期間バケット

2 に答える 2

Related

Reference