r - "which.max" と "which.min" に似た中央値の関数 / data.frame から中央値の行を抽出する

Question

変数の 1 つの値に基づいて、data.frame から特定の行を抽出する必要がある場合があります。最大 ( ) と最小 ( )Rの組み込み関数があり、これらの行を簡単に抽出できます。which.max()which.min()

中央値に相当するものはありますか？それとも、自分の関数を書くのが最善の策ですか?

以下は、data.frame の例と、 and の使用方法which.max()ですwhich.min()。

set.seed(1) # so you can reproduce this example
dat = data.frame(V1 = 1:10, V2 = rnorm(10), V3 = rnorm(10), 
                 V4 = sample(1:20, 10, replace=T))

# To return the first row, which contains the max value in V4
dat[which.max(dat$V4), ]
# To return the seventh row, which contains the min value in V4
dat[which.min(dat$V4), ]

この特定の例では、観察数が偶数であるため、2 つの行 (この場合は行 2 と行 10) を返す必要があります。

アップデート

このための組み込み関数はないようです。そのため、Sacha からの返信を出発点として使用して、次の関数を作成しました。

which.median = function(x) {
  if (length(x) %% 2 != 0) {
    which(x == median(x))
  } else if (length(x) %% 2 == 0) {
    a = sort(x)[c(length(x)/2, length(x)/2+1)]
    c(which(x == a[1]), which(x == a[2]))
  }
}

次のように使用できます。

# make one data.frame with an odd number of rows
dat2 = dat[-10, ]
# Median rows from 'dat' (even number of rows) and 'dat2' (odd number of rows)
dat[which.median(dat$V4), ]
dat2[which.median(dat2$V4), ]

これを改善するための提案はありますか？

score 15 · Accepted Answer

Sacha のソリューションは非常に一般的ですが、中央値 (または他の分位数) は順序統計であるため、(分位値のorder (x)代わりに)から対応するインデックスを計算できます。sort (x)

を調べるとquantile、タイプ 1 または 3 を使用できます。他のすべてのタイプは、場合によっては 2 つの値の (加重) 平均になります。

タイプ 3 を選択し、から少しコピー & ペーストすると、次のquantileようになります。

which.quantile <- function (x, probs, na.rm = FALSE){
  if (! na.rm & any (is.na (x)))
  return (rep (NA_integer_, length (probs)))

  o <- order (x)
  n <- sum (! is.na (x))
  o <- o [seq_len (n)]

  nppm <- n * probs - 0.5
  j <- floor(nppm)
  h <- ifelse((nppm == j) & ((j%%2L) == 0L), 0, 1)
  j <- j + h

  j [j == 0] <- 1
  o[j]
}

ちょっとしたテスト:

> x <-c (2.34, 5.83, NA, 9.34, 8.53, 6.42, NA, 8.07, NA, 0.77)
> probs <- c (0, .23, .5, .6, 1)
> which.quantile (x, probs, na.rm = TRUE)
[1] 10  1  6  6  4
> x [which.quantile (x, probs, na.rm = TRUE)] == quantile (x, probs, na.rm = TRUE, type = 3)

  0%  23%  50%  60% 100% 
TRUE TRUE TRUE TRUE TRUE

これがあなたの例です：

> dat [which.quantile (dat$V4, c (0, .5, 1)),]
  V1         V2          V3 V4
7  7  0.4874291 -0.01619026  1
2  2  0.1836433  0.38984324 13
1  1 -0.6264538  1.51178117 17

score 9 · Accepted Answer

私はただ思う：

which(dat$V4 == median(dat$V4))

ただし、中間の数値が 1 つもない場合、中央値は 2 つの数値の平均を取るため、注意が必要です。たとえばmedian(1:4)、どの要素にも一致しない 2.5 が返されます。

編集

which.min()これは、最小値のみに等しい最初の要素を与える方法と同様に、中央値の要素または中央値の平均との最初の一致のいずれかを与える関数です。

whichmedian <- function(x) which.min(abs(x - median(x)))

例えば：

> whichmedian(1:4)
[1] 2

score 2 · Accepted Answer

私のニーズを満たす、より包括的な関数を作成しました。

row.extractor = function(data, extract.by, what) {
# data = your data.frame
# extract.by = the variable that you are extracting by, either
#              as its index number or by name
# what = either "min", "max", "median", or "all", with quotes
  if (is.numeric(extract.by) == 1) {
    extract.by = extract.by
  } else if (is.numeric(extract.by) != 0) {
    extract.by = which(colnames(dat) %in% "extract.by")
  } 
  which.median = function(data, extract.by) {
    a = data[, extract.by]
    if (length(a) %% 2 != 0) {
      which(a == median(a))
    } else if (length(a) %% 2 == 0) {
      b = sort(a)[c(length(a)/2, length(a)/2+1)]
      c(max(which(a == b[1])), min(which(a == b[2])))
    }
  }
  X1 = data[which(data[extract.by] == min(data[extract.by])), ] 
  X2 = data[which(data[extract.by] == max(data[extract.by])), ]
  X3 = data[which.median(data, extract.by), ]
  if (what == "min") {
    X1
  } else if (what == "max") {
    X2
  } else if (what == "median") {
    X3
  } else if (what == "all") {
    rbind(X1, X3, X2)
  }
}

使用例:

> row.extractor(dat, "V4", "max")
  V1         V2       V3 V4
1  1 -0.6264538 1.511781 17
> row.extractor(dat, 4, "min")
  V1        V2          V3 V4
7  7 0.4874291 -0.01619026  1
> row.extractor(dat, "V4", "all")
   V1         V2          V3 V4
7   7  0.4874291 -0.01619026  1
2   2  0.1836433  0.38984324 13
10 10 -0.3053884  0.59390132 14
4   1 -0.6264538  1.51178117 17

score 2 · Accepted Answer

中央値を取得するベクトルがであるとしますx。

この関数which.min(x[x>=median(x)])は、の場合は中央値、length(x)=2*n+1またはの場合は 2 つの中間値のうち大きい方を返しますlength(x)=2*n。2 つの中間値のうち小さい方を取得する場合は、わずかに微調整できます。

r - "which.max" と "which.min" に似た中央値の関数 / data.frame から中央値の行を抽出する

アップデート

6 に答える 6

編集

Related

Reference