r - 値が 3 つ未満の位置で区切られている場合、値をグループ化して範囲の上限/下限を見つけます

Question

私のデータは、日付、安値、高値、位置の 4 つの列で構成されています。

位置フィールドに基づいてデータをグループに要約して範囲を見つけようとしています。

diff(position) < 3 の場合、データをグループ化し、各グループに範囲関数を適用します。
diff(position) >= 3 の場合、現在のポイントと前のポイントのみの範囲を計算します。

最初の 15 桁、データの 4 番目のフィールドの例:

c(12,14,17,18,19,20,21,22,24,28,33,36,37,38,43)

期待される結果は、(12,14)次に(17:24)、(24,28)、(28,33)、(33,36)、(36:38)、および最後(38,43)にグループ化し、各グループの範囲を見つけることです。

score 1 · Accepted Answer

diffこれは、グループ間の境界を識別するために使用するオプションです。

groupBy <- function(dat, thresh=3)  {
    # bounds will grab the *END* of every group (except last element)
    bounds <- which(! diff(dat) < thresh) 

    # add the last index of dat to the "stops" indecies
    stops  <- c(bounds, length(dat))

    # starts are 1 more than the bounds. We also add the first element 
    starts <- c(1, bounds+1) 

    # mapply to get `seq(starts, stops)`
    indecies <- mapply(seq, from=starts, to=stops)

    # return: lapply over each index to get the results
    lapply(indecies, function(i) dat[i])
}

テスト：

dat1 <- c(12,14,17,18,19,20,21,22,24,28,33,36,37,38,43)
dat2 <- c(5,6,7,9,13,17,21,35,36,41)

groupBy(dat1)
groupBy(dat2)
groupBy(dat2, 5)

score 1 · Accepted Answer

使用IRanges:

require(IRanges)
x <- c(12,14,17,18,19,20,21,22,24,28,33,36,37,38,43)
o <- reduce(IRanges(x, width=1), min.gapwidth=2)

与えます：

IRanges of length 6
    start end width
# [1]    12  14     3
# [2]    17  24     8
# [3]    28  28     1
# [4]    33  33     1
# [5]    36  38     3
# [6]    43  43     1

これで問題の半分が解決します。width = 1適切な以前の値を取得したい場所。それでは、これを data.frame に変換しましょう。

o <- as.data.frame(o)
o$start[o$width == 1] <- o$end[which(o$width == 1)-1]
o$width <- NULL

#   start end
# 1    12  14
# 2    17  24
# 3    24  28
# 4    28  33
# 5    36  38
# 6    38  43

これにより、最終結果が得られます。

編集：必要な範囲でOPが（14,17）を逃したようです。

ir <- IRanges(x, width = 1)
o1 <- reduce(ir, min.gapwidth = 2)
o2 <- gaps(o1)
start(o2) <- start(o2) - 1
end(o2) <- end(o2) + 1
o1 <- as.data.frame(o1[width(o1) > 1])
o2 <- as.data.frame(o2)
out <- rbind(o1, o2)
out <- out[with(out, order(start, end)), ]

#   start end width
# 1    12  14     3
# 4    14  17     4
# 2    17  24     8
# 5    24  28     5
# 6    28  33     6
# 7    33  36     4
# 3    36  38     3
# 8    38  43     6

score 1 · Accepted Answer

以下は、ベース R 関数を使用して、指定された規則に従ってグループ化された位置インデックスのリストを返す関数です。値が単調ではない可能性があり、絶対的な違いだけを気にする場合は、に変更diff(x)するabs(diff(x))(そして後続の単調性チェックを削除する) だけで十分だと思います。

groupIndexes <- function(x, gap=3) {
    d <- diff(x)
    # currently assuming x is in increasing order
    if (any(d<0)) stop("x must be monotonically increasing")
    is.near <- (d < gap)
    # catch case of a single group
    if (all(is.near)) return(list(seq_along(x)))
    runs <- rle(ifelse(is.near, 0, seq_along(is.near)))
    gr <- rep(seq.int(runs$lengths), times=runs$lengths)
    lapply(unique(gr), function(i) {
        ind <- if(runs$values[i]>0) {
            match(i, gr)
        } else {
            which(gr==i)
        }
        c(ind, max(ind)+1)
    })
}

これにより、このグループ化された値自体が生成されます。

x <- c(12,14,17,18,19,20,21,22,24,28,33,36,37,38,43)
lapply(groupIndexes(x), function(ind) x[ind])

実際のケースでデータフレーム「dat」がある場合、「位置」列に基づいてグループを生成し、「低」列のグループごとの範囲を次のように計算できます。

lapply(groupIndexes(dat$position), function(ind) range(dat$low[ind]))

r - 値が 3 つ未満の位置で区切られている場合、値をグループ化して範囲の上限/下限を見つけます

3 に答える 3

テスト：

Related

Reference