r - 複数のクラスの観測に基づく r のデータセットの分割

Question

Rにあるデータセットを、トレーニング用に2/3、テスト用に1/3に分割しようとしています。1 つの分類変数と 7 つの数値変数があります。各観測値は、A、B、C、または D のいずれかに分類されます。

簡単にするために、分類変数 cl は、最初の 100 観測では A、観測 101 から 200 では B、300 までは C、400 までは D であるとしましょう。 A、B、C、および D のそれぞれの観測値の (単純にデータセット全体の観測値の 2/3 を取得するのではなく)。

などのデータのサブセットからサンプリングしようとするとsample(subset(data, cl=='A'))、行ではなく列が並べ替えられます。

要約すると、私の目標は、A、B、C、および D のそれぞれから 67 個のランダムな観測値をトレーニングデータとして取得し、A、B、C、および D のそれぞれの残りの 33 個の観測値をテストデータとして保存することです。私と非常によく似た質問を見つけましたが、複数の変数を考慮していませんでした。

score 17 · Accepted Answer

実際には、機械学習の問題を処理するための優れたパッケージキャレットがあり、提供された因子の各レベルから 2/3 をサンプリングする関数createDataPartition()が含まれています。

#2/3rds for training
library(caret)
inTrain = createDataPartition(df$yourFactor, p = 2/3, list = FALSE)
dfTrain=df[inTrain,]
dfTest=df[-inTrain,]

score 5 · Accepted Answer

これはもっと長いかもしれませんが、より直感的で、ベースRで実行できると思います;）

# create the data frame you've described
x <-
    data.frame(
        cl = 
            c( 
                rep( 'A' , 100 ) ,
                rep( 'B' , 100 ) ,
                rep( 'C' , 100 ) ,
                rep( 'D' , 100 ) 
            ) ,

        othernum1 = rnorm( 400 ) ,
        othernum2 = rnorm( 400 ) ,
        othernum3 = rnorm( 400 ) ,
        othernum4 = rnorm( 400 ) ,
        othernum5 = rnorm( 400 ) ,
        othernum6 = rnorm( 400 ) ,
        othernum7 = rnorm( 400 ) 
    )

# sample 67 training rows within classification groups
training.rows <-
    tapply( 
        # numeric vector containing the numbers
        # 1 to nrow( x )
        1:nrow( x ) , 

        # break the sample function out by
        # the classification variable
        x$cl , 

        # use the sample function within
        # each classification variable group
        sample , 

        # send the size = 67 parameter
        # through to the sample() function
        size = 67 
    )

# convert your list back to a numeric vector
tr <- unlist( training.rows )

# split your original data frame into two:

# all the records sampled as training rows
training.df <- x[ tr , ]

# all other records (NOT sampled as training rows)
testing.df <- x[ -tr , ]

score 4 · Accepted Answer

以下はset、値を持つ列"train"または"test"data.frame に追加します。

library(plyr)
df <- ddply(df, "cl", transform, set = sample(c("train", "test"), length(cl),
                                              replace = TRUE, prob = c(2, 1)))

ave基本関数を使用して同様のものを取得できますがddply、この特定の使用法についてはかなりきれいな (読みやすい) ことがわかります。

subset次に、関数を使用してデータを分割できます。

train.data <- subset(df, set == "train")
test.data  <- subset(df, set == "test")

フォローアップ: 各グループを 2/3 と 1/3 のサイズに正確に分割するには、次を使用できます。

df <- ddply(df, "cl", transform,
            set = sample(c(rep("train", round(2/3 * length(cl)),
                           rep("test",  round(1/3 * length(cl)))))

r - 複数のクラスの観測に基づく r のデータセットの分割

4 に答える 4

Related

Reference