r - データフレームからカテゴリ別にランダムな行を選択していますか?

Question

次のようなデータフレームがあります。

Category Name Value

カテゴリごとに5つのランダムな名前を選択するにはどうすればよいですか? Usingsampleは、すべての行を可能な候補として使用して、ランダムな行を返します。ただし、カテゴリごとにランダムな行数を指定したいと思います。助言がありますか？

更新：私は使用することにオープンですddply

score 7 · Accepted Answer

テストケースがない場合の最良の推測:

  do.call( rbind, lapply( split(dfrm, df$cat) ,
                         function(df) df[sample(nrow(df), 5) , ] )
          )

ジョナサンのデータでテスト:

> do.call( rbind, lapply( split(df, df$Category) ,
+                          function(df) df[sample(nrow(df), 5) , ] )
+           )

      Category Name      Value   
1.8          1    8 -0.2496109   #  useful side-effect of labeling source group
1.15         1   15 -0.4037368
1.17         1   17 -0.4223724
1.12         1   12 -0.9359026
1.18         1   18  0.3741184
2.37         2   37  0.3033610
2.34         2   34 -0.4517738
2.36         2   36 -0.7695923
snipped remainder

score 4 · Accepted Answer

各カテゴリから同じ数のアイテムが必要な場合、これは簡単です。

df[unlist(tapply(1:nrow(df),df$Category,function(x) sample(x,3))),]

たとえば、df次のように生成しました。

df <- data.frame(Category=rep(1:5,each=20),Name=1:100,Value=rnorm(100))

次に、コードから次のようになります。

> df[unlist(tapply(1:nrow(df),df$Category,function(x) sample(x,3))),]
    Category Name       Value
5          1    5  0.25151044
20         1   20  1.52486482
18         1   18  0.69313462
30         2   30  0.73444185
27         2   27  0.24000427
39         2   39 -0.10108203
46         3   46 -0.37200574
49         3   49 -1.84920469
43         3   43  0.35976388
68         4   68  0.57879516
76         4   76 -0.11049302
64         4   64 -0.13471303
100        5  100  0.95979408
95         5   95 -0.01928741
99         5   99  0.85725242

カテゴリごとに異なる行数が必要な場合は、より複雑になります。

score 3 · Accepted Answer

過去に、「サンプリング」パッケージのいくつかの関数用に書いた小さなラッパーを使用しました。

関数は次のとおりです。

strata.sampling <- function(data, group, size, method = NULL) {
  #  USE: 
  #   * Specify a data.frame and grouping variable.
  #   * Decide on your sample size. For a sample proportional to the 
  #     population, enter "size" as a decimal. For an equal number of 
  #     samples from each group, enter "size" as a whole number. For
  #     a specific number of samples from each group, enter the numbers
  #     required as a vector.

  require(sampling)
  if (is.null(method)) method <- "srswor"
  if (!method %in% c("srswor", "srswr")) 
    stop('method must be "srswor" or "srswr"')
  temp <- data[order(data[[group]]), ]
  ifelse(length(size) > 1,
         size <- size, 
         ifelse(size < 1,
                size <- round(table(temp[group]) * size),
                size <- rep(size, times=length(table(temp[group])))))
  strat = strata(temp, stratanames = names(temp[group]), 
                 size = size, method = method)
  getdata(temp, strat)
}

使用方法は次のとおりです。

# Sample data --- Note each category has a different number of observations
df <- data.frame(Category = rep(1:5, times = c(40, 15, 7, 13, 25)), 
                 Name = 1:100, Value = rnorm(100))

# Sample 5 from each "Category" group
strata.sampling(df, "Category", 5)
# Sample 2 from the first category, 3 from the next, and so on
strata.sampling(df, "Category", c(2, 3, 4, 5, 2))
# Sample 15% from each group
strata.sampling(df, "Category", .15)

ここに書いた拡張機能もあります。この関数は、指定されたサンプル数よりもグループの観測値が少ない可能性があるケースを適切に処理し、複数の変数で階層化することもできます。いくつかの例については、ドキュメントを参照してください。

r - データフレームからカテゴリ別にランダムな行を選択していますか?

3 に答える 3

Related

Reference