r - p値を使用して有意でないp値を持つ変数を削除するステップワイズ回帰

Question

選択基準としてp値を使用して、段階的な線形回帰を実行したいと思います。たとえば、各ステップで、最も高い、つまり最も重要でないp値を持つ変数を削除し、すべての値が何らかのしきい値alphaによって有意に定義されたときに停止します。

代わりにAIC（コマンドstepやstepAICなど）またはその他の基準を使用する必要があることを完全に認識していますが、上司は統計を把握しておらず、p値の使用を主張しています。

必要に応じて、自分のルーチンをプログラムすることもできますが、これの実装済みバージョンがあるかどうか疑問に思っています。

score 31 · Accepted Answer

上司に次のものを見せてください。

set.seed(100)
x1 <- runif(100,0,1)
x2 <- as.factor(sample(letters[1:3],100,replace=T))

y <- x1+x1*(x2=="a")+2*(x2=="b")+rnorm(100)
summary(lm(y~x1*x2))

与える：

            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -0.1525     0.3066  -0.498  0.61995    
x1            1.8693     0.6045   3.092  0.00261 ** 
x2b           2.5149     0.4334   5.802 8.77e-08 ***
x2c           0.3089     0.4475   0.690  0.49180    
x1:x2b       -1.1239     0.8022  -1.401  0.16451    
x1:x2c       -1.0497     0.7873  -1.333  0.18566    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

さて、p値に基づいて、どれを除外しますか? x2 は、最も重要であると同時に最も重要ではありません。

編集：明確にするために：コメントに示されているように、この例は最善ではありません。Stata と SPSS の手順も、係数の T 検定の p 値ではなく、変数の 1 つを削除した後の F 検定に基づいていることがわかります。

まさにそれを行う機能があります。これは「p 値」の選択ですが、係数または anova 結果の T 検定ではありません。まあ、それがあなたにとって有用であると思われる場合は、自由に使用してください.

#####################################
# Automated model selection
# Author      : Joris Meys
# version     : 0.2
# date        : 12/01/09
#####################################
#CHANGE LOG
# 0.2   : check for empty scopevar vector
#####################################

# Function has.interaction checks whether x is part of a term in terms
# terms is a vector with names of terms from a model
has.interaction <- function(x,terms){
    out <- sapply(terms,function(i){
        sum(1-(strsplit(x,":")[[1]] %in% strsplit(i,":")[[1]]))==0
    })
    return(sum(out)>0)
}

# Function Model.select
# model is the lm object of the full model
# keep is a list of model terms to keep in the model at all times
# sig gives the significance for removal of a variable. Can be 0.1 too (see SPSS)
# verbose=T gives the F-tests, dropped var and resulting model after 
model.select <- function(model,keep,sig=0.05,verbose=F){
      counter=1
      # check input
      if(!is(model,"lm")) stop(paste(deparse(substitute(model)),"is not an lm object\n"))
      # calculate scope for drop1 function
      terms <- attr(model$terms,"term.labels")
      if(missing(keep)){ # set scopevars to all terms
          scopevars <- terms
      } else{            # select the scopevars if keep is used
          index <- match(keep,terms)
          # check if all is specified correctly
          if(sum(is.na(index))>0){
              novar <- keep[is.na(index)]
              warning(paste(
                  c(novar,"cannot be found in the model",
                  "\nThese terms are ignored in the model selection."),
                  collapse=" "))
              index <- as.vector(na.omit(index))
          }
          scopevars <- terms[-index]
      }

      # Backward model selection : 

      while(T){
          # extract the test statistics from drop.
          test <- drop1(model, scope=scopevars,test="F")

          if(verbose){
              cat("-------------STEP ",counter,"-------------\n",
              "The drop statistics : \n")
              print(test)
          }

          pval <- test[,dim(test)[2]]

          names(pval) <- rownames(test)
          pval <- sort(pval,decreasing=T)

          if(sum(is.na(pval))>0) stop(paste("Model",
              deparse(substitute(model)),"is invalid. Check if all coefficients are estimated."))

          # check if all significant
          if(pval[1]<sig) break # stops the loop if all remaining vars are sign.

          # select var to drop
          i=1
          while(T){
              dropvar <- names(pval)[i]
              check.terms <- terms[-match(dropvar,terms)]
              x <- has.interaction(dropvar,check.terms)
              if(x){i=i+1;next} else {break}              
          } # end while(T) drop var

          if(pval[i]<sig) break # stops the loop if var to remove is significant

          if(verbose){
             cat("\n--------\nTerm dropped in step",counter,":",dropvar,"\n--------\n\n")              
          }

          #update terms, scopevars and model
          scopevars <- scopevars[-match(dropvar,scopevars)]
          terms <- terms[-match(dropvar,terms)]

          formul <- as.formula(paste(".~.-",dropvar))
          model <- update(model,formul)

          if(length(scopevars)==0) {
              warning("All variables are thrown out of the model.\n",
              "No model could be specified.")
              return()
          }
          counter=counter+1
      } # end while(T) main loop
      return(model)
}

score 19 · Accepted Answer

step()テスト方法を指定する関数を使用してみませんか?

たとえば、後方消去の場合は、次のコマンドのみを入力します。

step(FullModel, direction = "backward", test = "F")

段階的な選択の場合は、次のようにします。

step(FullModel, direction = "both", test = "F")

これにより、AIC 値と F 値および P 値の両方を表示できます。

score 10 · Accepted Answer

これが例です。最も複雑なモデルから始めます。これには、3つの説明変数すべての間の相互作用が含まれます。

model1 <-lm (ozone~temp*wind*rad)
summary(model1)

Coefficients:
Estimate Std.Error t value Pr(>t)
(Intercept) 5.683e+02 2.073e+02 2.741 0.00725 **
temp          -1.076e+01 4.303e+00 -2.501 0.01401 *
wind          -3.237e+01 1.173e+01 -2.760 0.00687 **
rad           -3.117e-01 5.585e-01 -0.558 0.57799
temp:wind      2.377e-01 1.367e-01 1.739 0.08519   
temp:rad       8.402e-03 7.512e-03 1.119 0.26602
wind:rad       2.054e-02 4.892e-02 0.420 0.47552
temp:wind:rad -4.324e-04 6.595e-04 -0.656 0.51358

三者間相互作用は明らかに重要ではありません。これは、モデルの単純化のプロセスを開始するために、それを削除する方法です。

model2 <- update(model1,~. - temp:wind:rad)
summary(model2)

結果に応じて、モデルの簡略化を続けることができます。

model3 <- update(model2,~. - temp:rad)
summary(model3)
...

または、自動モデル簡略化関数を使用して、stepそれがどの程度うまく機能するかを確認することもできます。

model_step <- step(model1)

score 8 · Accepted Answer

パッケージrms: Regression Modeling Strategiesにはfastbw()、まさに必要なことを行う機能があります。AIC から p 値ベースの除去に切り替えるパラメーターもあります。

score 7 · Accepted Answer

最良の予測モデルを取得しようとしているだけであれば、おそらくそれほど重要ではありませんが、それ以外の場合は、この種のモデルの選択を気にする必要はありません。違います。

lm.ridge()リッジ回帰 (たとえば、MASS パッケージ内)、投げ縄、elasticnet (リッジと投げ縄の制約の組み合わせ)などの縮小方法を使用します。これらのうち、LASSO と Elastic Net だけが何らかの形のモデル選択を行います。つまり、一部の共変量の係数を強制的にゼロにします。

CRANのMachine Learningタスクビューの正則化と縮小セクションを参照してください。

r - p値を使用して有意でないp値を持つ変数を削除するステップワイズ回帰

6 に答える 6

Related