50  Walk-through of calculating robust standard errors

Author

Luke Miratrix

In this document, we’ll discuss approaches to dealing with clustered data which focus on getting the standard errors for the coefficients right, without bothering with modeling the second level. We’ll start by discussing an approach for correcting for heteroscedasticity (unequal variance in the residuals at different levels of the predictors), and then show how to use a similar technique to correct for residuals which may be correlated within clusters.

The goal is to show you how to use cluster-robust standard errors to correct for biased standard errors introduced by working with clustered data. We’ll also show you how you can implement some model-fitting techniques using the matrix operations in R.

We’ll be working with data we’ve seen before (The High School and Beyond dataset.)

While this document shows how to calculate things by hand, it also shows the relevant R packages to automate it so you don’t have to bother. The “by-hand” stuff is for interest, and to see what is happening under the hood.

50.1 Robust errors (no clustering)

The (no clustering, ordinary) linear regression model assumes that

\[y = X\beta + \varepsilon\]

with the \(\varepsilon\)’s independently and identically normally distributed with variance \(\sigma^2\). Here \(\beta\) is a column vector of regression coefficients, \((\beta_0, \beta_1)\) in our example. \(y\) is a vector of the outcomes and \(\varepsilon\) is a vector of the residuals. \(X\) is a \(n\) by \(p\) matrix referred to as the model matrix (p is the number of predictors, including the intercept). In this example, the first column of the matrix is all 1’s, for the intercept, and the second column is each person’s value for ses. The third is each person’s value for sector (which will be the same for all students in a single school).

dat = read.spss( "data/hsb1.sav", to.data.frame=TRUE )
sdat = read.spss( "data/hsb2.sav", to.data.frame=TRUE )
dat = merge( dat, sdat, by="id", all.x=TRUE )
dat = dat[ c( "id", "mathach", "ses", "sector" ) ]
dat$id <- factor( dat$id ) ### make the school variable a factor
head( dat )
    id mathach    ses sector
1 1224   5.876 -1.528      0
2 1224  19.708 -0.588      0
3 1224  20.349 -0.528      0
4 1224   8.781 -0.668      0
5 1224  17.898 -0.158      0
6 1224   4.583  0.022      0

Making a model matrix from a regression

X <- model.matrix( mathach ~ ses + sector, data = dat )
head( X )
  (Intercept)    ses sector
1           1 -1.528      0
2           1 -0.588      0
3           1 -0.528      0
4           1 -0.668      0
5           1 -0.158      0
6           1  0.022      0
y <- dat$mathach
head( y )
[1]  5.876 19.708 20.349  8.781 17.898  4.583

With these assumptions, our estimate for \(\beta\) using the OLS criterion is \(\hat{\beta} = (X^TX)^{-1}X^Ty\). We can calculate this directly with R.

solve(t(X) %*% X) %*% t(X) %*% y ##(X'X)^{-1}X'y
                 [,1]
(Intercept) 11.793254
ses          2.948558
sector       1.935013

Compare with lm: they are the same!

mod = lm(mathach ~ ses + sector, data = dat)
mod 

Call:
lm(formula = mathach ~ ses + sector, data = dat)

Coefficients:
(Intercept)          ses       sector  
     11.793        2.949        1.935  

We can also estimate standard errors for the coefficients by taking \(\sqrt{\hat{\sigma}^2diag((X^TX)^{-1})}\).

beta_hat <- solve(t(X) %*% X) %*% t(X) %*% y
preds <- X %*% beta_hat
resids <- y - preds
sigma_2_hat <- sum(resids^2)/(nrow(X)-3) ### estimate of the residual variance
sqrt(sigma_2_hat * diag(solve(t(X) %*% X))) ### using the matrix algebra
(Intercept)         ses      sector 
 0.10610213  0.09783058  0.15249341 

Again, compare:

library( arm )
display( mod ) ### same results
lm(formula = mathach ~ ses + sector, data = dat)
            coef.est coef.se
(Intercept) 11.79     0.11  
ses          2.95     0.10  
sector       1.94     0.15  
---
n = 7185, k = 3
residual sd = 6.35, R-Squared = 0.15

But notice that this assumes that the residuals have a single variance, \(\sigma^2\). Frequently this assumption is implausible, in which case the standard errors we derive may not be correct. It would be useful to have a way to derive standard errors which does not require us to assume that the residuals are homoscedastic. This is where heteroscedasticity-robust standard errors, or Huber-White standard errors, come in. Huber-White standard errors are asymptotically correct, even if the residual variance is not constant at all values of the predictor.

The basic idea behind Huber-White standard errors is that we let each individual residual serve as an estimate of the variance of the residuals at that value of the predictors. If we let \(V = (X^TX)^{-1},\) \(N\) be the number of observations, and \(K\) be the number of predictors, including the intercept, then the formula for the standard errors is

\[ SE^2 = \frac{N}{N-K} \cdot diag\left( V \cdot \left( \sum X_i X_i^T \varepsilon_i^2 \right) \cdot V\right) \]

This is called a sandwich estimator, where \(V\) is the bread and \(\sum X_i X_i^T \varepsilon_i^2\) (which is a \(K\) by \(K\) matrix) is the meat. Below, we implement this in R.

N <- nrow(dat) ### number of observations
K <- 3 ### number of regression coefficients, including the intercept
V <- solve(t(X) %*% X) ### the bread
V
              (Intercept)           ses        sector
(Intercept)  2.796108e-04  3.460078e-05 -0.0002847979
ses          3.460078e-05  2.377141e-04 -0.0000702375
sector      -2.847979e-04 -7.023750e-05  0.0005775742
meat <- matrix(0, nrow = K, ncol = K) ### we'll build the meat as we go, iterating over the 
                                      ### individual rows
for(i in 1:nrow(dat)){
  this_point <- X[i, ] %*% t(X[i, ]) * resids[i]^2 ### the contribution of this particular 
                                                   ### point
  meat <- meat + this_point ### take the current meat, and add this point's contribution
}
meat
     (Intercept)        ses     sector
[1,]  289161.019  -3048.176 133136.299
[2,]   -3048.176 159558.729   9732.201
[3,]  133136.299   9732.201 133136.299
SEs = sqrt(diag(N/(N-K) * V %*% meat %*% V)) ### standard errors
SEs
(Intercept)         ses      sector 
 0.11021454  0.09487279  0.15476724 

Notice that the estimated standard errors haven’t changed much, so whatever heteroscedasticity is present in this association doesn’t seem to be affecting them.

Combining the above steps in a tidy bit of code gives:

mod <- lm(mathach ~ ses + sector, data = dat)
resids = resid( mod )

X <- model.matrix(mathach ~ ses + sector, data = dat)

V <- solve(t(X) %*% X) ### the bread
vcov_hw = V %*% t(X) %*% diag(resids^2) %*% X %*% V

vcov_hw
             (Intercept)          ses       sector
(Intercept)  0.012142174  0.001957716 -0.012535538
ses          0.001957716  0.008997088 -0.003992666
sector      -0.012535538 -0.003992666  0.023942897
sqrt(diag(vcov_hw)) ### standard errors
(Intercept)         ses      sector 
 0.11019153  0.09485298  0.15473493 
sqrt( diag( vcov( mod ) ) )
(Intercept)         ses      sector 
 0.10610213  0.09783058  0.15249341 

50.1.1 R Packages to do all this for you

There is an R package to do all of this for us. The following gives us the “Variance Covariance” matrix:

library(sandwich)
vc <- vcovHC( mod, type = "HC0")
print( vc, digits=3 )
            (Intercept)      ses   sector
(Intercept)     0.01214  0.00196 -0.01254
ses             0.00196  0.00900 -0.00399
sector         -0.01254 -0.00399  0.02394

The square root of the diagonal are our standard errors

sqrt( diag( vc ) )
(Intercept)         ses      sector 
 0.11019153  0.09485298  0.15473493 

They are what we hand-calculated above (up to some rounding error). Observe how the differences are all very close to zero:

sqrt( diag( vc ) ) - SEs
  (Intercept)           ses        sector 
-2.301170e-05 -1.980850e-05 -3.231386e-05 

We can use them for testing as follows

library( lmtest )
coeftest( mod, vcov. = vc )

t test of coefficients:

             Estimate Std. Error t value  Pr(>|t|)    
(Intercept) 11.793254   0.110192 107.025 < 2.2e-16 ***
ses          2.948558   0.094853  31.086 < 2.2e-16 ***
sector       1.935013   0.154735  12.505 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Note the weird “.”. I don’t know why it is part of the name.)

In fact, these packages play well together, so you can tell lmtest to use the vcovHC function as follows:

coeftest( mod, vcov. = vcovHC )

t test of coefficients:

             Estimate Std. Error t value  Pr(>|t|)    
(Intercept) 11.793254   0.110237 106.981 < 2.2e-16 ***
ses          2.948558   0.094913  31.066 < 2.2e-16 ***
sector       1.935013   0.154801  12.500 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

All this is well and good, but everything we have done so far is WRONG because we have failed to account for the clustering of students within schools. Huber-White (Sandwich) corrections only deal with heteroskedasticity, not clustering. We extend these ideas to do clustering next.

50.2 Cluster Robust Standard Errors

The next step is to get standard errors which allow the residuals to be correlated within clusters and to have non-0 means within clusters (which violates the assumption of independence of residuals). The math here is harder to explain. We start by calculating \(X*\varepsilon\), multiplying each row in \(X\) by the associated residual. Then we take the column sum of \(X\) within each cluster. This is easiest to understand for the intercept column, where the sum is simply equal to the sum of the residuals in that cluster. If all of the residuals in a cluster are large and positive (or large and negative), then this sum will be very large; if the residuals are close to mean 0 in a cluster, the sum will be small. We then bind the results into a \(M\) by \(K\) matrix, where \(M\) is the number of clusters, each row corresponds to a cluster, and each column corresponds to a coefficient, which we’ll call \(U\). This is the meat which we sandwich with \(V\). Finally, we take

\[\sqrt{ diag( \frac{M}{M-1}\frac{N-1}{N-K} VU^TUV)}\]

which gives us estimated standard errors for the regression coefficients.

The intuition isn’t so clear here, but notice that the more highly correlated residuals are within clusters (especially clusters with extreme values of the predictors), the larger \(U^TU\) will be, and the less precise our estimates.

Here’s a “by hand” implementation in R.

cluster <- dat$id
M <- length(unique(cluster))
weight_mat <- as.vector(resids) * X ### start by calculating for each X predictor values 
                                    ### weighted by the residuals
head( weight_mat )
  (Intercept)        ses sector
1   -1.411858  2.1573194      0
2    9.648498 -5.6733165      0
3   10.112584 -5.3394444      0
4   -1.042618  0.6964687      0
5    6.570618 -1.0381576      0
6   -7.275123 -0.1600527      0
u_icept <- tapply(weight_mat[, '(Intercept)'], cluster, sum) ### sum up the weighted 
                                                             ### intercepts in each cluster
u_ses <- tapply(weight_mat[, 'ses'], cluster, sum) ### sum up the weighted slopes in 
                                                       ### each cluster
u_sector <- tapply(weight_mat[, 'sector'], cluster, sum)

u <- cbind(u_icept, u_ses, u_sector)

### cluster-robust standard errors
SE.adj.hand = sqrt((M/(M-1))*((N-1)/(N-K)) * diag(V %*% t(u) %*% u %*% V)) 
SE.adj.hand
(Intercept)         ses      sector 
  0.2031455   0.1279373   0.3171766 

These are a lot higher than before; there’s a lot of within-cluster correlation, and our OLS-based estimated standard errors are unrealistically small.

You can use these standard errors in general if you’re not interested in modeling what’s happening at the cluster level and just want to get the right standard errors for your fixed effects.

50.2.1 Using R Packages

There is a package that gives you the cluster-robust estimate of the variance-covariance matrix. You can then use this matrix to get your adjusted standard errors:

library( multiwayvcov )

m1 <- lm( mathach ~ ses + sector, data=dat )
vcov_id <- cluster.vcov(m1, dat$id)
coeftest(m1, vcov_id)

t test of coefficients:

            Estimate Std. Error t value  Pr(>|t|)    
(Intercept) 11.79325    0.20315 58.0532 < 2.2e-16 ***
ses          2.94856    0.12794 23.0469 < 2.2e-16 ***
sector       1.93501    0.31718  6.1007 1.111e-09 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Compare to if we ignored clustering:

coeftest( m1 )  ## BAD!!

t test of coefficients:

             Estimate Std. Error t value  Pr(>|t|)    
(Intercept) 11.793254   0.106102 111.150 < 2.2e-16 ***
ses          2.948558   0.097831  30.139 < 2.2e-16 ***
sector       1.935013   0.152493  12.689 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We can look at how much bigger they are:

SE.adj = sqrt( diag( vcov_id ) )
SE.bad = sqrt( diag( vcov( m1 ) ) )
SE.adj / SE.bad
(Intercept)         ses      sector 
   1.914623    1.307743    2.079937 

More than 100% bigger for our sector variable and intercept. The ses variable is less so, since it varies within cluster.

Finally, we check to see that our hand-calculation is the same as the package:

SE.adj.hand - SE.adj
  (Intercept)           ses        sector 
 1.296185e-14 -3.025358e-15 -2.997602e-15 

Up to rounding errors, we are the same!

50.2.2 Aside: Making your own function

The following is code to generate the var-cor matrix more efficiently. For reference (or to ignore):

 cl <- function(dat, fm, cluster){
   attach(dat, warn.conflicts = F)
   require(sandwich)
   require(lmtest)
   M <- length(unique(cluster))
   N <- length(cluster)
   K <- fm$rank
   dfc <- (M/(M-1))*((N-1)/(N-K))
   uj  <- apply(estfun(fm), 2, function(x) 
                       tapply(x, cluster, sum));
   vcovCL <- dfc*sandwich(fm, meat=crossprod(uj)/N)
   coeftest(fm, vcovCL)
 }
 
cl(dat, mod, dat$id)

t test of coefficients:

            Estimate Std. Error t value  Pr(>|t|)    
(Intercept) 11.79325    0.20315 58.0532 < 2.2e-16 ***
ses          2.94856    0.12794 23.0469 < 2.2e-16 ***
sector       1.93501    0.31718  6.1007 1.111e-09 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1