r - カレットを使用した完全に再現可能な並列モデル

Question

カレットで2つのランダムフォレストを実行する場合、ランダムシードを設定するとまったく同じ結果が得られます。

library(caret)
library(doParallel)

set.seed(42)
myControl <- trainControl(method='cv', index=createFolds(iris$Species))

set.seed(42)
model1 <- train(Species~., iris, method='rf', trControl=myControl)

set.seed(42)
model2 <- train(Species~., iris, method='rf', trControl=myControl)

> all.equal(predict(model1, type='prob'), predict(model2, type='prob'))
[1] TRUE

ただし、モデリングを高速化するために並列バックエンドを登録すると、モデルを実行するたびに異なる結果が得られます。

cl <- makeCluster(detectCores())
registerDoParallel(cl)

set.seed(42)
myControl <- trainControl(method='cv', index=createFolds(iris$Species))

set.seed(42)
model1 <- train(Species~., iris, method='rf', trControl=myControl)

set.seed(42)
model2 <- train(Species~., iris, method='rf', trControl=myControl)

stopCluster(cl)

> all.equal(predict(model1, type='prob'), predict(model2, type='prob'))
[1] "Component 2: Mean relative difference: 0.01813729"
[2] "Component 3: Mean relative difference: 0.02271638"

この問題を解決する方法はありますか？1つの提案は、doRNGパッケージを使用することでしたがtrain、現在サポートされていないネストされたループを使用します。

library(doRNG)
cl <- makeCluster(detectCores())
registerDoParallel(cl)
registerDoRNG()

set.seed(42)
myControl <- trainControl(method='cv', index=createFolds(iris$Species))

set.seed(42)
> model1 <- train(Species~., iris, method='rf', trControl=myControl)
Error in list(e1 = list(args = seq(along = resampleIndex)(), argnames = "iter",  : 
  nested/conditional foreach loops are not supported yet.
See the package's vignette for a work around.

doSNOW更新：この問題はとを使用して解決できると思いましclusterSetupRNGたが、そこに到達することはできませんでした。

set.seed(42)
library(caret)
library(doSNOW)
cl <- makeCluster(8, type = "SOCK")
registerDoSNOW(cl)

myControl <- trainControl(method='cv', index=createFolds(iris$Species))

clusterSetupRNG(cl, seed=rep(12345,6))
a <- clusterCall(cl, runif, 10000)
model1 <- train(Species~., iris, method='rf', trControl=myControl)

clusterSetupRNG(cl, seed=rep(12345,6))
b <- clusterCall(cl, runif, 10000)
model2 <- train(Species~., iris, method='rf', trControl=myControl)

all.equal(a, b)
[1] TRUE
all.equal(predict(model1, type='prob'), predict(model2, type='prob'))
[1] "Component 2: Mean relative difference: 0.01890339"
[2] "Component 3: Mean relative difference: 0.01656751"

stopCluster(cl)

foreachの何が特別で、クラスターで開始したシードを使用しないのはなぜですか？オブジェクトaとbは同一なので、なぜそうではないmodel1のmodel2ですか？

score 54 · Accepted Answer

パッケージを使用して完全に再現可能なモデルを並列モードで実行する簡単な方法の1つcaretは、トレインコントロールを呼び出すときにseeds引数を使用することです。上記の質問はここで解決されました。詳細については、trainControlヘルプページを確認してください。

library(doParallel); library(caret)

#create a list of seed, here change the seed for each resampling
set.seed(123)

#length is = (n_repeats*nresampling)+1
seeds <- vector(mode = "list", length = 11)

#(3 is the number of tuning parameter, mtry for rf, here equal to ncol(iris)-2)
for(i in 1:10) seeds[[i]]<- sample.int(n=1000, 3)

#for the last model
seeds[[11]]<-sample.int(1000, 1)

 #control list
 myControl <- trainControl(method='cv', seeds=seeds, index=createFolds(iris$Species))

 #run model in parallel
 cl <- makeCluster(detectCores())
 registerDoParallel(cl)
 model1 <- train(Species~., iris, method='rf', trControl=myControl)

 model2 <- train(Species~., iris, method='rf', trControl=myControl)
 stopCluster(cl)

 #compare
 all.equal(predict(model1, type='prob'), predict(model2, type='prob'))
[1] TRUE

score 9 · Accepted Answer

そのため、caretはforeachパッケージを使用して並列化します。反復ごとにシードを設定する方法が最もありそうですが、でより多くのオプションを設定する必要がありますtrain。

または、ランダムフォレストの内部関数を模倣するカスタムモデリング関数を作成し、自分でシードを設定することもできます。

マックス

score 0 · Accepted Answer

どのバージョンのカレットを使用していましたか？

@BBrillの答えは正しいです。ただし、v6.0.64（2016年1月15日）以降、caretはこの問題を考慮に入れています。カスタマイズしたものを提供することもできますtrControl$seedsが、必ずしもそうする必要はありません。の場合、caertは自動的にそれらを生成します。これにより、並列トレーニングの場合でも再現性が保証されますtrControl$seeds。NULL

この動作はhttps://github.com/topepo/caret/commit/9f375a1704e413d0806b73ab8891c7fadc39081cで見つけることができます

プルリクエスト：https ：//github.com/topepo/caret/pull/353

r - カレットを使用した完全に再現可能な並列モデル

3 に答える 3

Related

Reference