r - グループごとに上位N個の値を選択します

Question

を使用してグループごとに上位の値を見つける方法の例が たくさんあるので、Rパッケージsqlを使用するよりもその知識を変換するのは簡単だと思います。sqldf

例：mtcarsがでグループ化されcylている場合、の個別の値ごとの上位3つのレコードを次に示しますcyl。この場合、ネクタイは除外されますが、ネクタイを処理するいくつかの異なる方法を示すとよいでしょう。

                     mpg cyl  disp  hp drat    wt  qsec vs am gear carb ranks
Toyota Corona       21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1   2.0
Volvo 142E          21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2   1.0
Valiant             18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1   2.0
Merc 280            19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4   3.0
Merc 280C           17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4   1.0
Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4   1.5
Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4   1.5
Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4   3.0

グループごとに上位または下位（最大または最小）のNレコードを見つける方法は？

score 55 · Accepted Answer

data.tableこれは、キーを設定しながら並べ替えを実行するため、より簡単に使用できます。

したがって、ソート (昇順) で上位 3 つのレコードを取得する場合、次のようになります。

require(data.table)
d <- data.table(mtcars, key="cyl")
d[, head(.SD, 3), by=cyl]

それをします。

そして、降順が必要な場合

d[, tail(.SD, 3), by=cyl] # Thanks @MatthewDowle

編集：列を使用して同点を並べ替えるには：mpg

d <- data.table(mtcars, key="cyl")
d.out <- d[, .SD[mpg %in% head(sort(unique(mpg)), 3)], by=cyl]

#     cyl  mpg  disp  hp drat    wt  qsec vs am gear carb rank
#  1:   4 22.8 108.0  93 3.85 2.320 18.61  1  1    4    1   11
#  2:   4 22.8 140.8  95 3.92 3.150 22.90  1  0    4    2    1
#  3:   4 21.5 120.1  97 3.70 2.465 20.01  1  0    3    1    8
#  4:   4 21.4 121.0 109 4.11 2.780 18.60  1  1    4    2    6
#  5:   6 18.1 225.0 105 2.76 3.460 20.22  1  0    3    1    7
#  6:   6 19.2 167.6 123 3.92 3.440 18.30  1  0    4    4    1
#  7:   6 17.8 167.6 123 3.92 3.440 18.90  1  0    4    4    2
#  8:   8 14.3 360.0 245 3.21 3.570 15.84  0  0    3    4    7
#  9:   8 10.4 472.0 205 2.93 5.250 17.98  0  0    3    4   14
# 10:   8 10.4 460.0 215 3.00 5.424 17.82  0  0    3    4    5
# 11:   8 13.3 350.0 245 3.73 3.840 15.41  0  0    3    4    3

# and for last N elements, of course it is straightforward
d.out <- d[, .SD[mpg %in% tail(sort(unique(mpg)), 3)], by=cyl]

score 25 · Accepted Answer

dplyrトリックを行います

mtcars %>% 
arrange(desc(mpg)) %>% 
group_by(cyl) %>% slice(1:2)


 mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1  33.9     4  71.1    65  4.22 1.835 19.90     1     1     4     1
2  32.4     4  78.7    66  4.08 2.200 19.47     1     1     4     1
3  21.4     6 258.0   110  3.08 3.215 19.44     1     0     3     1
4  21.0     6 160.0   110  3.90 2.620 16.46     0     1     4     4
5  19.2     8 400.0   175  3.08 3.845 17.05     0     0     3     2
6  18.7     8 360.0   175  3.15 3.440 17.02     0     0     3     2

score 22 · Accepted Answer

なんでも並べ替えるだけです（たとえば、mpg、これについての質問は明確ではありません）

mt <- mtcars[order(mtcars$mpg), ]

次に by 関数を使用して、各グループの上位 n 行を取得します

d <- by(mt, mt["cyl"], head, n=4)

結果を data.frame にしたい場合:

Reduce(rbind, d)

編集： 同点の処理はより困難ですが、すべての同点が必要な場合：

by(mt, mt["cyl"], function(x) x[rank(x$mpg) %in% sort(unique(rank(x$mpg)))[1:4], ])

別のアプローチは、他の情報に基づいて関係を断ち切ることです。

mt <- mtcars[order(mtcars$mpg, mtcars$hp), ]
by(mt, mt["cyl"], head, n=4)

score 9 · Accepted Answer

これを行うには少なくとも 4 つの方法がありますが、それぞれに多少の違いがあります。u_id を使用してグループ化し、リフト値を使用して順序付け/並べ替えを行います

1 dplyr 従来の方法

library(dplyr)
top10_final_subset1 = final_subset %>% arrange(desc(lift)) %>% group_by(u_id) %>% slice(1:10)

そして、arrange(desc(lift)) と group_by(u_id) の順序を入れ替えると、結果は本質的に同じになります。同じリフト値で同点の場合は、各グループの値が 10 以下であることを確認するためにスライスします。、グループに 5 つのリフト値しかない場合、そのグループの結果は 5 つしか得られません。

2層トップNウェイ

library(dplyr)
top10_final_subset2 = final_subset %>% group_by(u_id) %>% top_n(10,lift)

同じ u_id に対して 15 個の同じリフトがあるとします。15 個すべての観測結果が得られます。

3 data.table テールウェイ

library(data.table)
final_subset = data.table(final_subset,key = "lift")
top10_final_subset3 = final_subset[,tail(.SD,10),,by = c("u_id")]

それは最初の方法と同じ行番号を持っていますが、いくつかの行が異なります.tieを扱うdiffランダムアルゴリズムを使用していると思います。

4 data.table.SDのやり方

library(data.table)
top10_final_subset4 = final_subset[,.SD[order(lift,decreasing = TRUE),][1:10],by = "u_id"]

この方法は最も「均一な」方法です。グループ内の観測値が 5 つしかない場合は、値を繰り返して 10 個の観測値にします。同点の場合は、スライスして 10 個の観測値に対してのみ保持します。

score 4 · Accepted Answer

mtcars$mpg の 4 番目の位置に同点があった場合、これはすべての同点を返す必要があります。

top_mpg <- mtcars[ mtcars$mpg >= mtcars$mpg[order(mtcars$mpg, decreasing=TRUE)][4] , ]

> top_mpg
                mpg cyl disp  hp drat    wt  qsec vs am gear carb
Fiat 128       32.4   4 78.7  66 4.08 2.200 19.47  1  1    4    1
Honda Civic    30.4   4 75.7  52 4.93 1.615 18.52  1  1    4    2
Toyota Corolla 33.9   4 71.1  65 4.22 1.835 19.90  1  1    4    1
Lotus Europa   30.4   4 95.1 113 3.77 1.513 16.90  1  1    5    2

3-4 の位置に同点があるため、4 を 3 に変更してテストできますが、それでも 4 項目が返されます。これは論理インデックス付けであり、NA を削除する句を追加するか、論理式を which() でラップする必要がある場合があります。これを「by」cylで行うのはそれほど難しくありません：

 Reduce(rbind,  by(mtcars, mtcars$cyl, 
        function(d) d[ d$mpg >= d$mpg[order(d$mpg, decreasing=TRUE)][4] , ]) )
#-------------
                   mpg cyl  disp  hp drat    wt  qsec vs am gear carb
Fiat 128          32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
Honda Civic       30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
Toyota Corolla    33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
Lotus Europa      30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
Mazda RX4         21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
Hornet 4 Drive    21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
Ferrari Dino      19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
Hornet Sportabout 18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
Merc 450SE        16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
Merc 450SL        17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
Pontiac Firebird  19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2

私の提案を@Istaに組み込む：

Reduce(rbind,  by(mtcars, mtcars$cyl, function(d) d[ d$mpg <= sort( d$mpg )[3] , ]) )

score 3 · Accepted Answer

データベースを因子で分割し、別の目的の変数で並べ替え、各因子 (カテゴリ) で必要な行数を抽出し、これらをデータベースに結合する関数を作成できます。

top<-function(x, num, c1,c2){
sorted<-x[with(x,order(x[,c1],x[,c2],decreasing=T)),]
splits<-split(sorted,sorted[,c1])
df<-lapply(splits,head,num)
do.call(rbind.data.frame,df)}

xはデータフレームです。

numは、表示する行数です。

c1は、分割する変数の列番号です。

c2は、ランク付けしたい、または同点を処理したい変数の列番号です。

関数は mtcars データを使用して、各シリンダークラス (mtcars$cyl は2番目の列) で最も重い3台の車 (mtcars$wt は6番目の列) を抽出します。

 top(mtcars,3,2,6)
                         mpg cyl  disp  hp drat    wt  qsec vs am gear carb
 4.Merc 240D           24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
 4.Merc 230            22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
 4.Volvo 142E          21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2
 6.Valiant             18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
 6.Merc 280            19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
 6.Merc 280C           17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
 8.Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
 8.Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
 8.Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4

また、lapply 関数の head をtailに変更するか、 order関数の減少 = T 引数を削除してデフォルトの減少 = F に戻すことで、クラスで最も軽いものを簡単に取得できます。

score 0 · Accepted Answer

# start with the mtcars data frame (included with your installation of R)
mtcars

# pick your 'group by' variable
gbv <- 'cyl'
# IMPORTANT NOTE: you can only include one group by variable here
# ..if you need more, the `order` function below will need
# one per inputted parameter: order( x$cyl , x$am )

# choose whether you want to find the minimum or maximum
find.maximum <- FALSE

# create a simple data frame with only two columns
x <- mtcars

# order it based on 
x <- x[ order( x[ , gbv ] , decreasing = find.maximum ) , ]

# figure out the ranks of each miles-per-gallon, within cyl columns
if ( find.maximum ){
    # note the negative sign (which changes the order of mpg)
    # *and* the `rev` function, which flips the order of the `tapply` result
    x$ranks <- unlist( rev( tapply( -x$mpg , x[ , gbv ] , rank ) ) )
} else {
    x$ranks <- unlist( tapply( x$mpg , x[ , gbv ] , rank ) )
}
# now just subset it based on the rank column
result <- x[ x$ranks <= 3 , ]

# look at your results
result

# done!

# but note only *two* values where cyl == 4 were kept,
# because there was a tie for third smallest, and the `rank` function gave both '3.5'
x[ x$ranks == 3.5 , ]

# ..if you instead wanted to keep all ties, you could change the
# tie-breaking behavior of the `rank` function.
# using the `min` *includes* all ties.  using `max` would *exclude* all ties
if ( find.maximum ){
    # note the negative sign (which changes the order of mpg)
    # *and* the `rev` function, which flips the order of the `tapply` result
    x$ranks <- unlist( rev( tapply( -x$mpg , x[ , gbv ] , rank , ties.method = 'min' ) ) )
} else {
    x$ranks <- unlist( tapply( x$mpg , x[ , gbv ] , rank , ties.method = 'min' ) )
}
# and there are even more options..
# see ?rank for more methods

# now just subset it based on the rank column
result <- x[ x$ranks <= 3 , ]

# look at your results
result
# and notice *both* cyl == 4 and ranks == 3 were included in your results
# because of the tie-breaking behavior chosen.

r - グループごとに上位N個の値を選択します

10 に答える 10

Related

Reference