r - dplyr を使用した複数の回帰モデルの適合

Question

dplyr を使用して各時間 (因子変数) のモデルを適合させたいのですが、エラーが発生し、何が問題なのかよくわかりません。

df.h <- data.frame( 
  hour     = factor(rep(1:24, each = 21)),
  price    = runif(504, min = -10, max = 125),
  wind     = runif(504, min = 0, max = 2500),
  temp     = runif(504, min = - 10, max = 25)  
)

df.h <- tbl_df(df.h)
df.h <- group_by(df.h, hour)

group_size(df.h) # checks out, 21 obs. for each factor variable

# different attempts:
reg.models <- do(df.h, formula = price ~ wind + temp)

reg.models <- do(df.h, .f = lm(price ~ wind + temp, data = df.h))

さまざまなバリエーションを試しましたが、うまくいきません。

score 19 · Accepted Answer

2020 年半ばの時点で、tchakravarty の回答は失敗します。broomとが相互作用しているように見えるという新しいアプローチを回避するために、とdpylrの次の組み合わせを使用できます。ティブルの中でそれらを使用する必要があります。broom::tidybroom::augmentbroom::glancedo()unnest()

library(dplyr)
library(broom)
library(tidyr)

df.h = data.frame( 
  hour     = factor(rep(1:24, each = 21)),
  price    = runif(504, min = -10, max = 125),
  wind     = runif(504, min = 0, max = 2500),
  temp     = runif(504, min = - 10, max = 25)  
)

df.h %>% group_by(hour) %>%
  do(fitHour = tidy(lm(price ~ wind + temp, data = .))) %>% 
  unnest(fitHour)
# # A tibble: 72 x 6
#    hour  term        estimate std.error statistic   p.value
#    <fct> <chr>          <dbl>     <dbl>     <dbl>     <dbl>
#  1 1     (Intercept)   82.4     18.1         4.55  0.000248 
#  2 1     wind         -0.0212   0.0108      -1.96  0.0655   
#  3 1     temp         -1.01     0.792       -1.28  0.218    
#  4 2     (Intercept)   25.9     19.7         1.31  0.206    
#  5 2     wind          0.0204   0.0131       1.57  0.135    
#  6 2     temp          0.680    1.01         0.670 0.511    
#  7 3     (Intercept)   88.3     15.5         5.69  0.0000214
#  8 3     wind         -0.0188   0.00998     -1.89  0.0754   
#  9 3     temp         -0.669    0.653       -1.02  0.319    
# 10 4     (Intercept)   73.4     14.2         5.17  0.0000639

df.h %>% group_by(hour) %>%
  do(fitHour = augment(lm(price ~ wind + temp, data = .))) %>% 
  unnest(fitHour)
# # A tibble: 24 x 13
#    hour  r.squared adj.r.squared sigma statistic p.value    df logLik   AIC   BIC deviance
#    <fct>     <dbl>         <dbl> <dbl>     <dbl>   <dbl> <dbl>  <dbl> <dbl> <dbl>    <dbl>
#  1 1        0.246        0.162    39.0     2.93   0.0790     2  -105.  218.  222.   27334.
#  2 2        0.161        0.0674   43.5     1.72   0.207      2  -107.  223.  227.   34029.
#  3 3        0.192        0.102    33.9     2.14   0.147      2  -102.  212.  217.   20739.
#  4 4        0.0960      -0.00445  34.3     0.956  0.403      2  -102.  213.  217.   21169.
#  5 5        0.230        0.144    31.7     2.68   0.0955     2  -101.  210.  214.   18088.
#  6 6        0.0190      -0.0900   39.8     0.174  0.842      2  -106.  219.  223.   28507.
#  7 7        0.0129      -0.0967   37.1     0.118  0.889      2  -104.  216.  220.   24801.
#  8 8        0.197        0.108    35.3     2.21   0.139      2  -103.  214.  218.   22438.
#  9 9        0.0429      -0.0634   39.4     0.403  0.674      2  -105.  219.  223.   27918.
# 10 10       0.0943      -0.00633  35.6     0.937  0.410      2  -103.  214.  219.   22854.
# # … with 14 more rows, and 2 more variables: df.residual <int>, nobs <int>

df.h %>% group_by(hour) %>%
  do(fitHour = glance(lm(price ~ wind + temp, data = .))) %>% 
  unnest(fitHour)
# # A tibble: 504 x 10
#    hour   price  wind   temp .fitted .resid .std.resid   .hat .sigma  .cooksd
#    <fct>  <dbl> <dbl>  <dbl>   <dbl>  <dbl>      <dbl>  <dbl>  <dbl>    <dbl>
#  1 1      94.2   883. -6.64     70.4  23.7       0.652 0.129    39.6 0.0209  
#  2 1      19.3  2107.  2.40     35.4 -16.0      -0.431 0.0864   39.9 0.00584 
#  3 1      60.5  2161. 18.3      18.1  42.5       1.18  0.146    38.5 0.0795  
#  4 1     116.   1244. 12.0      44.0  71.9       1.91  0.0690   35.8 0.0902  
#  5 1     117.   1624. -8.05     56.1  60.6       1.67  0.128    36.9 0.136   
#  6 1      75.0   220. -0.838    78.6  -3.58     -0.101 0.175    40.1 0.000724
#  7 1     106.    765.  6.15     60.0  45.7       1.22  0.0845   38.4 0.0461  
#  8 1      -9.89 2055. 12.3      26.5 -36.4      -0.979 0.0909   39.0 0.0319  
#  9 1      96.1   215. -8.36     86.3   9.82      0.287 0.232    40.0 0.00830 
# 10 1      27.2   323. 22.4      52.9 -25.7      -0.777 0.278    39.4 0.0774  
# # … with 494 more rows

そのインスピレーションについては、 Bob Muenchen のブログの功績によるものです。

score 10 · Accepted Answer

のドキュメントからdo：

.f: 各ピースに適用する関数。.f に指定された最初の名前のない引数は、データフレームになります。

そう：

reg.models <- do(df.h, 
                 .f=function(data){
                     lm(price ~ wind + temp, data=data)
                 })

モデルがどの時間に適合したかを保存するのにもおそらく便利です：

reg.models <- do(df.h, 
                 .f=function(data){
                     m <- lm(price ~ wind + temp, data=data)
                     m$hour <- unique(data$hour)
                     m
                 })

score 8 · Accepted Answer

dplyr@fabians anwser のように関数を定義する必要がない場合は、より適切な方法で使用できると思います。

results<-df.h %.% 
group_by(hour) %.% 
do(failwith(NULL, lm), formula = price ~ wind + temp)

また

results<-do(group_by(tbl_df(df.h), hour),
failwith(NULL, lm), formula = price ~ wind + temp)

編集： もちろん、なくても機能しますfailwith

results<-df.h %.% 
    group_by(hour) %.% 
    do(lm, formula = price ~ wind + temp)


results<-do(group_by(tbl_df(df.h), hour),
lm, formula = price ~ wind + temp)

score 4 · Accepted Answer

dplyr 1.0.0 以降、group_splitこのアクションの便利なショートカットが提供されます。

library(dplyr)
library(broom)
library(purrr)
df.h <- data.frame( 
  hour     = factor(rep(1:24, each = 21)),
  price    = runif(504, min = -10, max = 125),
  wind     = runif(504, min = 0, max = 2500),
  temp     = runif(504, min = - 10, max = 25)  
)

df.g <- group_split(df.h, hour)
map_dfr(df.g, function(x) tidy(lm(price ~ wind + temp, data=x)))
#> # A tibble: 72 x 5
#>    term        estimate std.error statistic p.value
#>    <chr>          <dbl>     <dbl>     <dbl>   <dbl>
#>  1 (Intercept) -10.4      20.3       -0.512 0.615  
#>  2 wind          0.0377    0.0117     3.23  0.00467
#>  3 temp          1.34      0.890      1.50  0.150  
#>  4 (Intercept)  34.6      18.6        1.86  0.0799 
#>  5 wind          0.0214    0.0125     1.71  0.104  
#>  6 temp          0.332     0.865      0.384 0.706  
#>  7 (Intercept)  42.5      15.3        2.79  0.0122 
#>  8 wind          0.0103    0.0116     0.888 0.386  
#>  9 temp         -0.542     0.736     -0.736 0.471  
#> 10 (Intercept)  64.1      18.8        3.41  0.00312
#> # … with 62 more rows

^{reprex パッケージ(v1.0.0)により 2021-03-04 に作成}

r - dplyr を使用した複数の回帰モデルの適合

7 に答える 7

Related

Reference