r - 予測変数の数が異なるテストデータに対して predict.glmnet を実行できますか?

Question

glmnet を使用して、二項回帰/分類問題のために、約 200 の予測子と 100 のサンプルを含むトレーニングセットで予測モデルを構築しました。

最大 AUC が得られる最適なモデル (16 個の予測子) を選択しました。私は、トレーニングセットから最終的なモデルになった変数 (16 個の予測子) のみを持つ独立したテストセットを持っています。

トレーニングセットから最終的なモデルになった変数のみのデータを持つ新しいテストセットを使用して、トレーニングセットの最適なモデルに基づいて predict.glmnet を使用する方法はありますか?

score 3 · Accepted Answer

glmnetトレーニングデータセットの変数とまったく同じ数/名前が検証/テストセットに含まれている必要があります。例えば：

library(caret)
library(glmnet)
df <- ... # a dataframe with 200 variables, some of which you want to predict on 
      #  & some of which you don't care about.
      # Variable 13 ('Response.Variable') is the dependent variable.
      # Variables 1-12 & 14-113 are the predictor variables
      # All training/testing & validation datasets are derived from this single df.

# Split dataframe into training & testing sets
inTrain <- createDataPartition(df$Response.Variable, p = .75, list = FALSE)
Train <- df[ inTrain, ] # Training dataset for all model development
Test <- df[ -inTrain, ] # Final sample for model validation

# Run logistic regression , using only specified predictor variables 
logCV <- cv.glmnet(x = data.matrix(Train[, c(1:12,14:113)]), y = Train[,13],
family = 'binomial', type.measure = 'auc')

# Test model over final test set, using specified predictor variables
# Create field in dataset that contains predicted values
Test$prob <- predict(logCV,type="response", newx = data.matrix(Test[,   
                     c(1:12,14:113) ]), s = 'lambda.min')

完全に新しいデータセットの場合、次の方法のいくつかの変形を使用して、新しい df を必要な変数に制限できます。

new.df <- ... # new df w/ 1,000 variables, which include all predictor variables used 
              # in developing the model

# Create object with requisite predictor variable names that we specified in the model
predictvars <- c('PredictorVar1', 'PredictorVar2', 'PredictorVar3', 
                  ... 'PredictorVarK')
new.df$prob <- predict(logCV,type="response", newx = data.matrix(new.df[names(new.df)
                        %in% predictvars ]), s = 'lambda.min')
                       # the above method limits the new df of 1,000 variables to                                                     
                       # whatever the requisite variable names or indices go into the 
                       # model.

さらに、glmnet行列のみを扱います。これがおそらく、質問へのコメントに投稿したエラーが発生する理由です。一部のユーザー (私自身を含む) はas.matrix()、問題が解決しないことを発見しました。data.matrix()ただし、機能しているようです（したがって、上記のコードにある理由）。この問題は、SO の 1 つまたは 2 つのスレッドで対処されています。

予測される新しいデータセットのすべての変数も、モデル開発に使用されたデータセットと同じようにフォーマットする必要があると思います。私は通常、すべてのデータを同じソースから取得しているためglmnet、フォーマットが異なる場合にどうなるかはわかりません。

r - 予測変数の数が異なるテスト データに対して predict.glmnet を実行できますか?

1 に答える 1

Related

Reference

r - 予測変数の数が異なるテストデータに対して predict.glmnet を実行できますか?