r - パネルデータの二重クラスター標準誤差

Question

R にパネルデータセット (時間と断面積) があり、2 つの次元でクラスター化された標準誤差を計算したいと考えています。これは、残差が双方向で相関しているためです。グーグルで検索すると、これを行う機能を提供するhttp://thetarzan.wordpress.com/2011/06/11/clustered-standard-errors-in-r/が見つかりました。少しアドホックに思えるので、テスト済みでこれを行うパッケージがあるかどうか知りたいですか?

HAC 標準誤差は知っsandwichていますが、二重クラスタリング (つまり、2 次元に沿ったもの) は行いません。

score 10 · Accepted Answer

これは古い質問です。しかし、人々がまだそれに着地しているように見えるので、R でのマルチウェイクラスタリングへの最新のアプローチをいくつか提供したいと思いました。

オプション 1 (最速):`fixest::feols()`

library(fixest)

nlswork = haven::read_dta("http://www.stata-press.com/data/r14/nlswork.dta")

est_feols = feols(ln_wage ~ age | race + year, cluster = ~race+year, data = nlswork)
est_feols

## An important feature of fixest: We can _instantaneously_ compute other
## VCOV matrices / SEs on the fly with summary.fixest(). No need to re-run
## the model!
summary(est_feols, se = 'standard') ## vanilla SEs
summary(est_feols, se = 'hetero') ## robust SEs
summary(est_feols, se = 'twoway') ## diff syntax, but same as original model
summary(est_feols, cluster = c('race', 'year')) ## ditto
summary(est_feols, cluster = ~race^year) ## interacted cluster vars
summary(est_feols, cluster = ~ race + year + idcode)  ## add third cluster var (not in original model call)
## etc.

オプション 2 (高速):`lfe::felm()`

library(lfe)

## Note the third "| 0 " slot means we're not using IV

est_felm = felm(ln_wage ~ age | race + year | 0 | race + year, data = nlswork)
summary(est_felm)

オプション 3 (遅いが柔軟):`sandwich`

library(sandwich)
library(lmtest)

est_sandwich = lm(ln_wage ~ age + factor(race) + factor(year), data = nlswork) 
coeftest(est_sandwich, vcov = vcovCL, cluster = ~ race + year)

基準

ああ、そして、速度についてのポイントを説明するために。これは、3 つの異なるアプローチのベンチマークです (2 つの固定 FE と双方向クラスタリングを使用)。

est_feols = function() feols(ln_wage ~ age | race + year, cluster = ~race+year, data = nlswork) 
est_felm = function() felm(ln_wage ~ age | race + year | 0 | race + year, data = nlswork)
est_standwich = function() {coeftest(lm(ln_wage ~ age + factor(race) + factor(year), data = nlswork), 
                                     vcov = vcovCL, cluster = ~ race + year)}

microbenchmark(est_feols(), est_felm(), est_standwich(), times = 3)

#> Unit: milliseconds
#>             expr       min        lq      mean    median        uq       max neval cld
#>      est_feols()  11.94122  11.96158  12.55835  11.98193  12.86692  13.75191     3 a  
#>       est_felm()  87.18064  95.89905 100.69589 104.61746 107.45352 110.28957     3  b 
#>  est_standwich() 176.43502 183.95964 188.50271 191.48425 194.53656 197.58886     3   c

score 7 · Accepted Answer

パネル回帰の場合、plmパッケージは 2 つの次元に沿ってクラスター化された SE を推定できます。

M. Petersen のベンチマーク結果を使用:

require(foreign)
require(plm)
require(lmtest)
test <- read.dta("http://www.kellogg.northwestern.edu/faculty/petersen/htm/papers/se/test_data.dta")

##Double-clustering formula (Thompson, 2011)
vcovDC <- function(x, ...){
    vcovHC(x, cluster="group", ...) + vcovHC(x, cluster="time", ...) - 
        vcovHC(x, method="white1", ...)
}

fpm <- plm(y ~ x, test, model='pooling', index=c('firmid', 'year'))

これで、クラスター化された SE を取得できるようになります。

##Clustered by *group*
> coeftest(fpm, vcov=function(x) vcovHC(x, cluster="group", type="HC1"))

t test of coefficients:

            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 0.029680   0.066952  0.4433   0.6576    
x           1.034833   0.050550 20.4714   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

##Clustered by *time*
> coeftest(fpm, vcov=function(x) vcovHC(x, cluster="time", type="HC1"))

t test of coefficients:

            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 0.029680   0.022189  1.3376   0.1811    
x           1.034833   0.031679 32.6666   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

##Clustered by *group* and *time*
> coeftest(fpm, vcov=function(x) vcovDC(x, type="HC1"))

t test of coefficients:

            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 0.029680   0.064580  0.4596   0.6458    
x           1.034833   0.052465 19.7243   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

詳細については、次を参照してください。

Fama-MacBeth および Cluster-Robust (企業および時間による) R の標準誤差。

ただし、上記は、データを強制的にpdata.frame. あると失敗します"duplicate couples (time-id)"。この場合でもクラスター化できますが、1 つの次元に沿ってのみです。

インデックスを1 つplmだけ指定することで、適切なパネルデータセットがあると思い込ませます。

fpm.tr <- plm(y ~ x, test, model='pooling', index=c('firmid'))

これで、クラスター化された SE を取得できるようになります。

##Clustered by *group*
> coeftest(fpm.tr, vcov=function(x) vcovHC(x, cluster="group", type="HC1"))

t test of coefficients:

            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 0.029680   0.066952  0.4433   0.6576    
x           1.034833   0.050550 20.4714   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

この回避策を使用して、より高い次元またはより高いレベル(industryまたはなどcountry) でクラスター化することもできます。ただし、その場合、アプローチの主な制限であるgroup(or time)を使用できなくなります。effects

パネルと他のタイプのデータの両方で機能するもう 1 つのアプローチは、multiwayvcovパッケージです。二重クラスタリングだけでなく、高次元でのクラスタリングも可能です。パッケージのWeb サイトによると、これは Arai のコードを改良したものです。

欠落のために削除された観測の透過的な処理

完全な多方向 (または n 方向、n 次元、または多次元) クラスタリング

Petersen データとを使用すると、次のようになりcluster.vcov()ます。

library("lmtest")
library("multiwayvcov")

data(petersen)
m1 <- lm(y ~ x, data = petersen)

coeftest(m1, vcov=function(x) cluster.vcov(x, petersen[ , c("firmid", "year")]))
## 
## t test of coefficients:
## 
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.029680   0.065066  0.4561   0.6483    
## x           1.034833   0.053561 19.3206   <2e-16 ***
## ---
## Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

score 5 · Accepted Answer

新井の関数は、標準誤差のクラスタリングに使用できます。彼は多次元でクラスタリングするための別のバージョンを持っています:

mcl <- function(dat,fm, cluster1, cluster2){
          attach(dat, warn.conflicts = F)
          library(sandwich);library(lmtest)
          cluster12 = paste(cluster1,cluster2, sep="")
          M1  <- length(unique(cluster1))
          M2  <- length(unique(cluster2))   
          M12 <- length(unique(cluster12))
          N   <- length(cluster1)          
          K   <- fm$rank             
          dfc1  <- (M1/(M1-1))*((N-1)/(N-K))  
          dfc2  <- (M2/(M2-1))*((N-1)/(N-K))  
          dfc12 <- (M12/(M12-1))*((N-1)/(N-K))  
          u1j   <- apply(estfun(fm), 2, function(x) tapply(x, cluster1,  sum)) 
          u2j   <- apply(estfun(fm), 2, function(x) tapply(x, cluster2,  sum)) 
          u12j  <- apply(estfun(fm), 2, function(x) tapply(x, cluster12, sum)) 
          vc1   <-  dfc1*sandwich(fm, meat=crossprod(u1j)/N )
          vc2   <-  dfc2*sandwich(fm, meat=crossprod(u2j)/N )
          vc12  <- dfc12*sandwich(fm, meat=crossprod(u12j)/N)
          vcovMCL <- vc1 + vc2 - vc12
          coeftest(fm, vcovMCL)}

参照と使用例については、次を参照してください。

R のクラスタ化された標準誤差

score 4 · Accepted Answer

Frank Harrell のパッケージrms(以前はという名前でしDesignた) には、クラスタリングの際によく使用する関数がありますrobcov。

?robcovたとえば、のこの部分を参照してください。

cluster: a variable indicating groupings. ‘cluster’ may be any type of
      vector (factor, character, integer).  NAs are not allowed.
      Unique values of ‘cluster’ indicate possibly correlated
      groupings of observations. Note the data used in the fit and
      stored in ‘fit$x’ and ‘fit$y’ may have had observations
      containing missing values deleted. It is assumed that if any
      NAs were removed during the original model fitting, an
      ‘naresid’ function exists to restore NAs so that the rows of
      the score matrix coincide with ‘cluster’. If ‘cluster’ is
      omitted, it defaults to the integers 1,2,...,n to obtain the
      "sandwich" robust covariance matrix estimate.

r - パネル データの二重クラスター標準誤差

4 に答える 4

オプション 1 (最速):fixest::feols()

オプション 2 (高速):lfe::felm()

オプション 3 (遅いが柔軟):sandwich

基準

Related

Reference

r - パネルデータの二重クラスター標準誤差

オプション 1 (最速):`fixest::feols()`

オプション 2 (高速):`lfe::felm()`

オプション 3 (遅いが柔軟):`sandwich`