r - 因子の各レベルからの代表的なオブザベーションが分割にあることを確認するにはどうすればよいですか?

Question

データセットをトレーニングセットとテストセットに分割する小さな関数を作成しました。ただし、因子変数を扱うときに問題が発生しています。コードのモデル検証フェーズで、モデルが因子の各レベルからの表現を持たないデータセットに基づいて構築されている場合、エラーが発生します。この partition() 関数を修正して、因子変数のすべてのレベルから少なくとも 1 つのオブザベーションを含めるにはどうすればよいですか?

test.df <- data.frame(a = sample(c(0,1),100, rep = T),
                      b = factor(sample(letters, 100, rep = T)),
                      c = factor(sample(c("apple", "orange"), 100, rep = T)))

set.seed(123)
partition <- function(data, train.size = .7){
  train <- data[sample(1:nrow(data), round(train.size*nrow(data)), rep= FALSE), ]
  test <- data[-as.numeric(row.names(train)), ]
  partitioned.data <- list(train = train, test = test)
  return(partitioned.data)
}

part.data <- partition(test.df)
table(part.data$train[,'b'])
table(part.data$test[,'b'])

編集- 「caret」パッケージと createDataPartition() を使用した新しい関数:

partition <- function(data, factor=NULL, train.size = .7){
  if (("package:caret" %in% search()) == FALSE){
    stop("Install and Load 'caret' package")
  }
  if (is.null(factor)){
    train.index <- createDataPartition(as.numeric(row.names(data)),
                                       times = 1, p = train.size, list = FALSE)
    train <- data[train.index, ]
    test <- data[-train.index, ]
  }
  else{
    train.index <- createDataPartition(factor,
                                       times = 1, p = train.size, list = FALSE)
    train <- data[train.index, ]
    test <- data[-train.index, ]
  }
  partitioned.data <- list(train = train, test = test)
  return(partitioned.data)
}

score 6 · Accepted Answer

キャレットパッケージ、特に function を試してくださいcreateDataPartition()。それはあなたが必要とすることを正確に行う必要があります.CRANで入手できます.ホームページはここにあります:

キャレット - データ分割

私が言及した関数は、部分的にネット上で見つけたコードの一部であり、エッジケース (セットまたはサブセットよりも大きなサンプルサイズを要求する場合など) をより適切に処理するために少し変更しました。

stratified <- function(df, group, size) {
  # USE: * Specify your data frame and grouping variable (as column
  # number) as the first two arguments.
  # * Decide on your sample size. For a sample proportional to the
  # population, enter "size" as a decimal. For an equal number
  # of samples from each group, enter "size" as a whole number.
  #
  # Example 1: Sample 10% of each group from a data frame named "z",
  # where the grouping variable is the fourth variable, use:
  #
  # > stratified(z, 4, .1)
  #
  # Example 2: Sample 5 observations from each group from a data frame
  # named "z"; grouping variable is the third variable:
  #
  # > stratified(z, 3, 5)
  #
  require(sampling)
  temp = df[order(df[group]),]
  colsToReturn <- ncol(df)

  #Don't want to attempt to sample more than possible
  dfCounts <- table(df[group])
  if (size > min(dfCounts)) {
    size <- min(dfCounts)
  }



  if (size < 1) {
    size = ceiling(table(temp[group]) * size)
  } else if (size >= 1) {
    size = rep(size, times=length(table(temp[group])))
  }
  strat = strata(temp, stratanames = names(temp[group]),
                 size = size, method = "srswor")
  (dsample = getdata(temp, strat))

  dsample <- dsample[order(dsample[1]),]
  dsample <- data.frame(dsample[,1:colsToReturn], row.names=NULL)
  return(dsample)

}

r - 因子の各レベルからの代表的なオブザベーションが分割にあることを確認するにはどうすればよいですか?

1 に答える 1

Related

Reference