11  An Introduction to Missing Data

Author

Maxime Rischard and Luke Miratrix

11.1 Introduction

Handling missing data is the icky, unglamorous part of any statistical analysis. It is where the skeletons lie. There’s a range of options available, which are, broadly speaking:

  1. Delete the observations with missing covariates (this is a “complete case analysis”)
  2. Plug in reasonable values for the missing values. This is called “imputation.” We discuss three ways of doing this that are increasingly sophisticated and layered on each other:
  1. Mean imputation. Simply take the mean of all the observations where you know the value, and then use that for anything that is missing.
  2. Regression imputation. Generate regression equations describing how all the variables are connected, and use those to predict any missing value.
  3. Stochastic regression imputation. Use regression imputation, but also add some residual noise to all our imputed values so that our imputed values have as much variation as our actual values (otherwise our imputed values will tend to be all clumped together).
  1. Multiply impute the missing data, by fully modeling the covariate and the missingness, and generating a range of complete datasets under this model. Here you end up with a collection of complete datasets that are all “reasonable guesses” as to what the full dataset might have been. You then analyze each one, and aggregate your findings across them to get a final answer.

The first two general approaches are imperfect, while the third is often more work than the original analysis that we were hoping to perform. For this course, doing a 2a, 2b, or 2c are all reasonable choices. If you have very little missing data you can often get away with #1. We have no expectations that people will take the plunge into #3 (multiple imputation). In real life, people will often analyze their data with a complete case analysis and some other strategy, and then compare the results. In Education, if missingness is below 10% people usually just do mean imputation, but regression imputation would probably be superior.

This handout provides an introduction to missing data, and includes a few commands to explore and deal with missing data. In this document we first talk about exploring missing data (in particular getting plots that show you if you have any notable patterns in how things are missing) and then we give a brief walk-through of the 3 methods listed above.

We will the mice and VIM packages, which you can install using install.packages() if you have not yet done so. These are simple and powerful packages for visualizing and imputing missing data. At the end of this document we also describe the Amelia package.

library(tidyverse)
library(mice)
library(VIM)

Throughout we use a small built-in R dataset on air quality as a working example.

  data(airquality)
  nrow(airquality)
[1] 153
  head(airquality)
  Ozone Solar.R Wind Temp Month Day
1    41     190  7.4   67     5   1
2    36     118  8.0   72     5   2
3    12     149 12.6   74     5   3
4    18     313 11.5   62     5   4
5    NA      NA 14.3   56     5   5
6    28      NA 14.9   66     5   6
  summary(airquality[1:4])
     Ozone          Solar.R         Wind            Temp     
 Min.   :  1.0   Min.   :  7   Min.   : 1.70   Min.   :56.0  
 1st Qu.: 18.0   1st Qu.:116   1st Qu.: 7.40   1st Qu.:72.0  
 Median : 31.5   Median :205   Median : 9.70   Median :79.0  
 Mean   : 42.1   Mean   :186   Mean   : 9.96   Mean   :77.9  
 3rd Qu.: 63.2   3rd Qu.:259   3rd Qu.:11.50   3rd Qu.:85.0  
 Max.   :168.0   Max.   :334   Max.   :20.70   Max.   :97.0  
 NA's   :37      NA's   :7                                   

11.2 Visualizing missing data

Just like with anything in statistics, the first thing to do is to look at our data. We want to know which variables are often missing, and if some variables are often missing together. We also want to know how much data is missing. The mice package has a variety of plots to show us patterns of missingness:

  md.pattern(airquality)

    Wind Temp Month Day Solar.R Ozone   
111    1    1     1   1       1     1  0
35     1    1     1   1       1     0  1
5      1    1     1   1       0     1  1
2      1    1     1   1       0     0  2
       0    0     0   0       7    37 44

This plot gives us the different missing data patterns and the number of observations that have each missing data pattern. For example, the second row in the plot says there are 35 observations that have a missing data pattern where only Ozone is missing.

We can also just look at 10 observations to see everything that is going on. Here we take the first 10 rows of our dataset, but could also take a random 10 row with the tidyverse’s sample_n method.

  airqualitysub = airquality[1:10, ]
  airqualitysub
   Ozone Solar.R Wind Temp Month Day
1     41     190  7.4   67     5   1
2     36     118  8.0   72     5   2
3     12     149 12.6   74     5   3
4     18     313 11.5   62     5   4
5     NA      NA 14.3   56     5   5
6     28      NA 14.9   66     5   6
7     23     299  8.6   65     5   7
8     19      99 13.8   59     5   8
9      8      19 20.1   61     5   9
10    NA     194  8.6   69     5  10

We see that we have one observation missing two covariates and one each of missing Ozone only and Solar.R only.

11.2.1 The VIM Package

The VIM package gives some alternate plots to explore missing data patterns. For example, aggr():

 aggr(airquality, col=c('navyblue','red'),
      numbers=TRUE, sortVars=TRUE, labels=names(data),
      cex.axis=.7, gap=3, prop=c(TRUE, FALSE), 
      ylab=c("Histogram of missing data","Pattern"))


 Variables sorted by number of missings: 
 Variable  Count
    Ozone 0.2418
  Solar.R 0.0458
     Wind 0.0000
     Temp 0.0000
    Month 0.0000
      Day 0.0000

On the left, we have the proportion of missing data for each variable in our dataset; We can see that Ozone and Solar.R have missing values, and Ozone is clearly missing at higher rates than Solar.R On the right, we have the joint distribution of missingness. We can see that 111 observations have no missing values. From those with missing values, the majority have missing values for Ozone, some have missing values for Solar.R and only 2 observations have missing values for both Ozone and Solar.R.

  marginplot(airquality[1:2])

Here we have a scatterplot for the first two variables in our dataset: Ozone and Solar.R. These are the variables that have missing data. In addition to the standard scatterplot we are familiar with, information about missingness is shown in the margins. The red dots indicate observations with one or both values missing (so there can be a bunch of dots stacked up in the bottom-left corner). The numbers (37, 7, and 2 tells us how many observations are missing either or both of these variables).

11.3 Complete case analysis

Working with complete cases (dropping observations with any missing data on our outcome and predictors) is always an option. We have been doing this in class and section. However, this can lead to substantial data loss, if we have a lot of missingness and it can heavily bias our results depending on why observations are missing.

Complete case analysis is the R default.

  fit <- lm(Ozone ~ Wind, data = airquality )
  summary(fit)

Call:
lm(formula = Ozone ~ Wind, data = airquality)

Residuals:
   Min     1Q Median     3Q    Max 
-51.57 -18.85  -4.87  15.23  90.00 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)    96.87       7.24   13.38  < 2e-16 ***
Wind           -5.55       0.69   -8.04  9.3e-13 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 26.5 on 114 degrees of freedom
  (37 observations deleted due to missingness)
Multiple R-squared:  0.362, Adjusted R-squared:  0.356 
F-statistic: 64.6 on 1 and 114 DF,  p-value: 9.27e-13

Note the listing in the summary of number of items deleted. You can find out which rows were deleted:

## which rows/observations were deleted
  deleted <- na.action(fit)
  deleted
  5  10  25  26  27  32  33  34  35  36  37  39  42  43  45  46  52  53  54  55 
  5  10  25  26  27  32  33  34  35  36  37  39  42  43  45  46  52  53  54  55 
 56  57  58  59  60  61  65  72  75  83  84 102 103 107 115 119 150 
 56  57  58  59  60  61  65  72  75  83  84 102 103 107 115 119 150 
attr(,"class")
[1] "omit"
  naprint(deleted)
[1] "37 observations deleted due to missingness"

We have more incomplete rows if we add Solar.R as predictor.

  fit2 <- lm(Ozone ~ Wind+Solar.R, data=airquality)
  naprint(na.action(fit2))
[1] "42 observations deleted due to missingness"

We can also drop observations with missing data ourselves instead of letting R do it for us. Dropping data preemptively is generally a good idea, especially if you plan on using predict().

## complete cases on all variables in the data set 
complete.v1 = filter( airquality, complete.cases(airquality) )
  
## drop observations with missing values, but ignoring a specific
## variable
complete.v2 = filter( airquality, 
                      complete.cases( select( airquality, -Wind) ) )

## drop observations with missing values on a specific variable
complete.v3 = filter(airquality, !is.na(Ozone))

Once you have subset your data, you just analyze what is left as normal. Easy as pie!

11.4 Mean imputation

Instead of dropping observations with missing values, we can plug in some kind of reasonable value for the missing value, e.g. the grand/global mean. While this can be statistically questionable, it does allow us to use the information provided by that unit’s outcome and other covariates, without, we hope, unduly affecting the analysis of the missing covariate.

Generally, people will first plug in the mean value for anything missing, but then also make a dummy variable of whether that observation had a missing value there (or sometimes any missing value). You would then include both the original vector of covariates (with the means plugged in) along with the dummy variable in subsequent regressions and analyses.

11.4.1 Doing Mean Imputation manually

Manually, we can just replace missing values for a variable with the grand/global mean.

## make a new copy of the data
  data.mean.impute = airquality
  
## select the observations with missing Ozone
  miss.ozone = is.na(data.mean.impute$Ozone)
  
## replace those NAs with mean(Ozone)
  data.mean.impute[miss.ozone,"Ozone"] = mean(airquality$Ozone, na.rm=TRUE)

In a multi-level context, it might make more sense to impute using the group mean rather than the grand mean. Here’s a generic function to do it. Here we group by month:

## a function that replaces missing values in a vector 
## by the mean of the other values
  mean.impute = function(y) { 
      y[is.na(y)] = mean(y, na.rm=TRUE)
      return(y)
  }

  data.mean.impute = airquality %>% group_by(Month) %>%
    mutate(Ozone = mean.impute(Ozone),
           Solar.R = mean.impute(Solar.R) ) 

We have mean imputed the Ozone column and the Solar.R column

11.4.2 Mean imputation with the Mice package

We can use the mice package to do mean imputation. The mice package is a package that can do some quite complex imputation, and so when you call mice() (which says “impute missing values please”) you get back a rather complex object telling you what mice imputed, for whom, etc. This object, which is a mids object (see help(mids)), contains the multiply imputed dataset (or in our case, so far, singly imputed). The mice package then provides a lot of nice functions allowing you to get your imputed information out of this object.

We first demonstrate this for the smaller 10 observations dataset we made above. Mice is generally going to be a two-step process: impute data, get completed dataset.

For step 1:

  imp <- mice(airqualitysub, method="mean", m=1, maxit=1)

 iter imp variable
  1   1  Ozone  Solar.R
Warning: Number of logged events: 1
  imp
Class: mids
Number of multiple imputations:  1 
Imputation methods:
  Ozone Solar.R    Wind    Temp   Month     Day 
 "mean"  "mean"      ""      ""      ""      "" 
PredictorMatrix:
        Ozone Solar.R Wind Temp Month Day
Ozone       0       1    1    1     0   1
Solar.R     1       0    1    1     0   1
Wind        1       1    0    1     0   1
Temp        1       1    1    0     0   1
Month       1       1    1    1     0   1
Day         1       1    1    1     0   0
Number of logged events:  1 
  it im dep     meth   out
1  0  0     constant Month

For step 2:

  cmp = complete(imp)
  cmp
   Ozone Solar.R Wind Temp Month Day
1   41.0     190  7.4   67     5   1
2   36.0     118  8.0   72     5   2
3   12.0     149 12.6   74     5   3
4   18.0     313 11.5   62     5   4
5   23.1     173 14.3   56     5   5
6   28.0     173 14.9   66     5   6
7   23.0     299  8.6   65     5   7
8   19.0      99 13.8   59     5   8
9    8.0      19 20.1   61     5   9
10  23.1     194  8.6   69     5  10

We see there are no missing values in cmp. They were all imputed with the mean of the other non-missing values. This is mean imputation.

Now let’s impute the full dataset.

  imp <- mice(airquality, method="mean", m=1, maxit=1)

 iter imp variable
  1   1  Ozone  Solar.R
  cmp = complete( imp )

We next make a dummy variable for each row of our data noting whether anything was imputed or not. We use the ici (Incomplete Case Indication) function to list all rows with any missing values.

  head( ici(airquality) )
[1] FALSE FALSE FALSE FALSE  TRUE  TRUE

Note how we have a TRUE or FALSE for each row of our data.

We then store our ici values as a covariate in our completed dataset:

  cmp$imputed = ici(airquality)
  head( cmp )
  Ozone Solar.R Wind Temp Month Day imputed
1  41.0     190  7.4   67     5   1   FALSE
2  36.0     118  8.0   72     5   2   FALSE
3  12.0     149 12.6   74     5   3   FALSE
4  18.0     313 11.5   62     5   4   FALSE
5  42.1     186 14.3   56     5   5    TRUE
6  28.0     186 14.9   66     5   6    TRUE

11.4.2.1 How well did mean imputation work?

Mean imputation has problems. The imputed values will all be the same, and thus when we look at how much variation is in our variables after imputation, it will go down. Compare the SD of our completed dataset Ozone values to the SD of the Ozone values for our non-missing values.

  sd( airquality$Ozone, na.rm=TRUE )
[1] 33
  sd( cmp$Ozone )
[1] 28.7

Next, let’s look at some plots of our completed data, coloring the points by whether they were imputed.

library(ggplot2)
ggplot( cmp, aes(x=Ozone, col=imputed) ) +
    stat_bin( geom="step", position="identity",
              breaks=seq(-20, 200, 10) )

ggplot( cmp, aes(y=Ozone, x=Solar.R, col=imputed) ) +
    geom_point() +
    labs( y="Ozone (ppb)", x="Solar Radiation (lang)" )

What we see in the above plots is that our imputed observations do not look like the rest of our data because one (or both) of their values always is in the exact center. This creates the “+” shape. It also gives the big spike at the mean for the histogram.

11.4.2.2 Important Aside: Namespaces and function collisions

We now need to discuss a sad aspect of R. The short story is, different packages have functions with the same names and so if you have both packages loaded you will need to specify which package to use when calling such a function. You can do this by giving the “surname” of the function at the beginning of the function call. Namespace collision comes up here because the method complete() exists both in the tidyverse and in mice. In tidyverse, complete() fills in rows of missing combinations of values. In mice, complete() gives us a completed dataset after we have made an imputation call.

It turns out that since we loaded tidyverse first and mice second, the mice’s complete() method is the default. But if we loaded the packages in the other order, we would get strange errors. To be clear, we can tell R to use mice’s complete() by writing

  cmp = mice::complete( imp )

In general, you can detect such “namespace collisions” by noticing weird error messages all of a sudden when you don’t expect them. You can then type help( complete ) in the console and the help index page will list all the different completes that have been loaded.

Also when you load a package it will write down what functions are getting mixed up for you. If you were looking at your R console output you should see something like this:

tidyr::complete() masks mice::complete()

11.5 Regression imputation

Regression imputation is half way between mean imputation and multiple imputation. In regression imputation we predict what values we expect for anything missing based on the other values of the observation. For example, if, in the nonmissing part of our data, urban/rural is correlated with race, we might impute a different value for race if we know an observation came from an urban environment vs. rural. We do this with regression: we fit a model predicting each variable using the others and then use that regression model to predict any missing values.

We can do this manually, but then it gets very hard when multiple variables are missing for a given observation. The mice package is more clever: it does variables one at a time, and the cycles around so everything can get imputed using the completed data from earlier iterations.

11.5.1 Manually

Here is how to use other variables to predict missing values.

  ic( airqualitysub )
   Ozone Solar.R Wind Temp Month Day
5     NA      NA 14.3   56     5   5
6     28      NA 14.9   66     5   6
10    NA     194  8.6   69     5  10
  fit <- lm(Ozone ~ Solar.R, data=airqualitysub)

## predict for missing ozone  
  need.pred = subset( airqualitysub, is.na( Ozone ) )
  need.pred
   Ozone Solar.R Wind Temp Month Day
5     NA      NA 14.3   56     5   5
10    NA     194  8.6   69     5  10
  pred <- predict(fit, newdata=need.pred)
  pred
   5   10 
  NA 23.1 

But now we have to merge our imputed data back in, and we still haven’t solved for observation 5 because we are missing the variable we would use to predict the other missing variable! Ick. Multiple missing values on multiple variables is where missing data gets really hard. So let’s quit now and turn to a package that will handle all of this for us.

11.5.2 Mice

To do regression imputation using mice, we simply call the mice() method:

  imp <- mice(airquality[,1:2], method="norm.predict",
              m=1, maxit=3 )

 iter imp variable
  1   1  Ozone  Solar.R
  2   1  Ozone  Solar.R
  3   1  Ozone  Solar.R

The m=1 means “make 1 imputed dataset.” The maxit=3 means “Do a maximum of 3 cycles of imputation (so each variable could be imputed up to three times).

Now imp is just like with mean imputation and so we have everything! How did it do it? By chaining equations. First mice starts with mean imputation. This gives an initial complete dataset. Then mice uses a prediction model to re-predict for the missing values for one covariate. We now have a new and improved (and still complete) dataset. We use this new dataset to run a prediction model on the next covariate, and so forth. We cycle around a few times and then everything is jointly predicting everything else.

The complete() method gives us the final complete dataset with everything imputed. Like so:

  cdat = mice::complete( imp )
  head( cdat )
  Ozone Solar.R
1  41.0     190
2  36.0     118
3  12.0     149
4  18.0     313
5  42.6     186
6  28.0     169
  nrow( cdat )
[1] 153
  nrow( airquality )
[1] 153

Next we make a variable of which cases have imputed values and not (any row with missing data must have been partially imputed.)

  cdat$imputed = ici( airquality )

And see our results! Compare to mean imputation, above.

ggplot( cdat, aes(x=Ozone, col=imputed) ) +
    stat_bin( geom="step", position="identity",
              breaks=seq(-20, 200, 10) )

ggplot( cdat, aes(y=Ozone, x=Solar.R, col=imputed) ) +
    geom_point() +
    labs( y="Ozone (ppb)", x="Solar Radiation (lang)" )

Our regression imputation is better than mean imputation. See how we impute different Ozone for different Solar Radiation values, taking advantage of the information of knowing that they are correlated? But it still is obvious what is mean imputed and what is not. Also, the variance of our imputed values still does not contain the residual variation around the predicted values that we would get in real data. We can do one more enhancement to fix this.

11.5.3 Stochastic regression imputation

We extend regression imputation by randomly drawing observations that look like real ones. The easiest way to think of this is we first predict our missing value, and then we add some residual noise to account for the fact that any covariate value would not be perfectly predicted by the other values.

To illustrate, we first do it on our mini-dataset and look at the imputed values for our observations with missing values only:

imp <- mice( airqualitysub[,1:2], 
             method="norm.nob",
             m=1, maxit=1, seed=1234)

 iter imp variable
  1   1  Ozone  Solar.R
imp$imp
$Ozone
       1
5  36.67
10 -7.32

$Solar.R
    1
5 235
6 238
imp <- mice( airqualitysub[,1:2],
             method="norm.nob", 
             m=1, maxit=1, seed=4234)

 iter imp variable
  1   1  Ozone  Solar.R
imp$imp
$Ozone
      1
5  30.8
10 57.6

$Solar.R
    1
5 277
6 498

Stochastic imputation is random. Because of this, we now have a new argument to mice(): the seed = 1234 sets the random seed so that we can get the same results if we run this code again. We set two different seeds to get two different randomly generated imputations. Note in the two imputations above how we get slightly different values for our imputed data.

Now let’s do it on the full data and look at the imputed values and compare to our plots above.

  imp <- mice(airquality[,1:2], method="norm.nob",
              m=1,maxit=1,seed=13434)

 iter imp variable
  1   1  Ozone  Solar.R
  cdat = mice::complete( imp )
  cdat$imputed = ici( airquality )

  ggplot( cdat, aes(x=Ozone, col=imputed) ) +
    stat_bin( geom="step", position="identity",
              breaks=seq(-20, 200, 10) )

  ggplot( cdat, aes(y=Ozone, x=Solar.R, col=imputed) ) +
    geom_point() +
    labs( y="Ozone (ppb)", x="Solar Radiation (lang)" )

Better, but not perfect. What is better? What is still not perfect?

First aside: All variables you’ll be using for your final analysis model should be included in the imputation model. Notice we included the full dataset in mice, not just the variables with missing values. This way we can account for associations between all the outcome and the predictors in the model we’ll be fitting. Your imputation model can even be more complicated than your model of interest. That is, you can include additional variables that predict missing values but will not be part of your final model of interest.

Second aside: All variables in your imputation model should be in the correct functional form! Quadratic, higher order polynomials and interaction terms are just another variable that we would need to impute. Although it may seem logical to impute your variables first and then calculate the interaction or non-linear term later, this can lead to bias.

Third aside: The ordering of the variables in the dataset you are feeding into mice can make a difference in results and model convergence. Generally, you want to order your variables from least missing to most missing. Here, we reorder the variables from least to most missing, and obtain different results.

  test = airqualitysub[,c(6,5,4,3,2,1)]
  head(test)
  Day Month Temp Wind Solar.R Ozone
1   1     5   67  7.4     190    41
2   2     5   72  8.0     118    36
3   3     5   74 12.6     149    12
4   4     5   62 11.5     313    18
5   5     5   56 14.3      NA    NA
6   6     5   66 14.9      NA    28
  test.imp <- mice(test, seed=2, m=1, print=FALSE)
Warning: Number of logged events: 1

11.6 Multiple imputation

If missing data is a significant issue in your dataset, then mean or regression imputation can be a bit too hacky and approximate. In these contexts, multiple imputation is the way to go.

We do this as follows:

  imp <- mice(airqualitysub, seed=2, print=FALSE)
Warning: Number of logged events: 1
  imp
Class: mids
Number of multiple imputations:  5 
Imputation methods:
  Ozone Solar.R    Wind    Temp   Month     Day 
  "pmm"   "pmm"      ""      ""      ""      "" 
PredictorMatrix:
        Ozone Solar.R Wind Temp Month Day
Ozone       0       1    1    1     0   1
Solar.R     1       0    1    1     0   1
Wind        1       1    0    1     0   1
Temp        1       1    1    0     0   1
Month       1       1    1    1     0   1
Day         1       1    1    1     0   0
Number of logged events:  1 
  it im dep     meth   out
1  0  0     constant Month

We can look at our imputed solar values:

imp$imp$Solar.R
    1  2   3  4  5
5 149 99 194 99 19
6 194 19 194 19 19

See multiple columns of imputed data? (We have 5 columns here, one for each imputation, with each row a single observation we are imputing for.)

How to get each complete dataset?

## first complete dataset 
mice::complete(imp, 1)
   Ozone Solar.R Wind Temp Month Day
1     41     190  7.4   67     5   1
2     36     118  8.0   72     5   2
3     12     149 12.6   74     5   3
4     18     313 11.5   62     5   4
5     18     149 14.3   56     5   5
6     28     194 14.9   66     5   6
7     23     299  8.6   65     5   7
8     19      99 13.8   59     5   8
9      8      19 20.1   61     5   9
10    36     194  8.6   69     5  10
## and our second complete dataset
mice::complete(imp, 2)
   Ozone Solar.R Wind Temp Month Day
1     41     190  7.4   67     5   1
2     36     118  8.0   72     5   2
3     12     149 12.6   74     5   3
4     18     313 11.5   62     5   4
5     41      99 14.3   56     5   5
6     28      19 14.9   66     5   6
7     23     299  8.6   65     5   7
8     19      99 13.8   59     5   8
9      8      19 20.1   61     5   9
10    18     194  8.6   69     5  10

See how the two datasets are are different? They were randomly imputed. We basically used the stochastic regression thing, above, multiple times.

  mice::complete(imp, 1)[ ici(airqualitysub), ]
   Ozone Solar.R Wind Temp Month Day
5     18     149 14.3   56     5   5
6     28     194 14.9   66     5   6
10    36     194  8.6   69     5  10
  mice::complete(imp, 2)[ ici(airqualitysub), ]
   Ozone Solar.R Wind Temp Month Day
5     41      99 14.3   56     5   5
6     28      19 14.9   66     5   6
10    18     194  8.6   69     5  10

On full data:

  imp <- mice(airquality, seed=1, print=FALSE)

Once you have imputed, you can estimate for each imputed dataset using the with() method that, in mice, will do the regression or other specified analysis for each completed dataset. See help with.mids for more information.

  fit <- with(imp, lm(Ozone ~ Wind + Temp + Solar.R))
  class(fit)
[1] "mira"

We get back a mira object, which is essentially a list of results, one for each imputed dataset. The with() method can take any function call that takes a formula. So glm, lm, whatever…

The next step is to pool the estimates using the standard theory of combining multiply imputed datasets. The basic idea is to combine the variation/uncertainty of the multiple sets with the average uncertainty we would have for each set, if it were truly complete and not imputed.

Pooling looks like this:

pooled <- pool(fit)
pooled
Class: mipo    m = 5 
         term m estimate     ubar        b        t dfcom   df   riv lambda
1 (Intercept) 5 -63.1480 3.66e+02 2.88e+02 7.12e+02   149 13.8 0.945  0.486
2        Wind 5  -3.0132 3.02e-01 1.37e-01 4.67e-01   149 24.0 0.546  0.353
3        Temp 5   1.5974 4.52e-02 2.50e-02 7.52e-02   149 19.6 0.663  0.399
4     Solar.R 5   0.0566 3.92e-04 8.44e-05 4.94e-04   149 52.4 0.258  0.205
    fmi
1 0.547
2 0.401
3 0.452
4 0.234

Parameter estimates generated by pool are the average of the parameter estimates from the model fit on each imputed dataset separately. The SEs are calculated by averaging the SEs and inflating them a bit based on how much the individual estimates changed–this accounts for the uncertainty from having missing values that we can only guess at, rather than see directly.

We can summarize our analysis to get p-values like this:

  tab <- summary(pooled)
  knitr::kable( tab )
term estimate std.error statistic df p.value
(Intercept) -63.148 26.677 -2.37 13.8 0.033
Wind -3.013 0.683 -4.41 24.0 0.000
Temp 1.597 0.274 5.83 19.6 0.000
Solar.R 0.057 0.022 2.55 52.4 0.014

We can compare to our old, sad method of complete case analysis:

  fit <- lm(Ozone~Wind+Temp+Solar.R,
            data=airquality,na.action=na.omit)
  summary( fit )

Call:
lm(formula = Ozone ~ Wind + Temp + Solar.R, data = airquality, 
    na.action = na.omit)

Residuals:
   Min     1Q Median     3Q    Max 
-40.48 -14.22  -3.55  10.10  95.62 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -64.3421    23.0547   -2.79   0.0062 ** 
Wind         -3.3336     0.6544   -5.09  1.5e-06 ***
Temp          1.6521     0.2535    6.52  2.4e-09 ***
Solar.R       0.0598     0.0232    2.58   0.0112 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 21.2 on 107 degrees of freedom
  (42 observations deleted due to missingness)
Multiple R-squared:  0.606, Adjusted R-squared:  0.595 
F-statistic: 54.8 on 3 and 107 DF,  p-value: <2e-16

In this case, the missing data estimates are basically the same as the complete case analysis. We only had 5% missing data though; generally, with small amounts of missing data, nothing changes anything much.

Aside: You will notice that once we fit our model on the imputed data, with() returned an object of class mira. Mira objects can be pooled to get the pooled estimates, whereas objects of class glm, lm, lmer, etc. cannot be pooled.

You unfortunately cannot use predict() with a mira object. To use predict, one trick is to first stack the imputed datasets into one mega dataset, and then fit your model on this complete dataset.

11.7 Extensions

11.7.1 Non-continuous variables

Everything shown above can easily be extended to non-continuous variables. The easiest way to do this is using the mice package. It allows you to specify the type of variable you are imputing, e.g. dichotomous or categorical. Mice will automatically detect and handle non-continuous variables. You can also specify these variables yourself. Here is an example using nhanes data (another built-in R dataset).

## load data 
  data(nhanes2)
  head(nhanes2)
    age  bmi  hyp chl
1 20-39   NA <NA>  NA
2 40-59 22.7   no 187
3 20-39   NA   no 187
4 60-99   NA <NA>  NA
5 20-39 20.4   no 113
6 60-99   NA <NA> 184
## create some missing values for an ordered categorical variable
  nhanes2$age[1:5] = NA
  head(nhanes2) 
    age  bmi  hyp chl
1  <NA>   NA <NA>  NA
2  <NA> 22.7   no 187
3  <NA>   NA   no 187
4  <NA>   NA <NA>  NA
5  <NA> 20.4   no 113
6 60-99   NA <NA> 184
## impute 5 datasets 
  imp.cat <- mice(nhanes2, m = 5, print=FALSE)     
  full.cat = mice::complete(imp.cat)        
  head(full.cat)
    age  bmi hyp chl
1 40-59 21.7 yes 187
2 40-59 22.7  no 187
3 20-39 27.5  no 187
4 60-99 24.9 yes 218
5 40-59 20.4  no 113
6 60-99 24.9 yes 184

We can check what imputation method mice used for each variable:

  imp.cat$method
      age       bmi       hyp       chl 
"polyreg"     "pmm"  "logreg"     "pmm" 

We can see that mice used the polyreg imputation method for the variable age, which means it treated it as an unordered categorical variable. But we know our age variable is an ordered variable: higher values categories signified older age. We can manually force mice to treat age as an ordered as so:

  imp.cat2 <- mice(nhanes2, 
                   meth=c("polr","pmm","logreg","pmm"),
                   m=5, print=FALSE)
  head(mice::complete(imp.cat2, 1))
    age  bmi hyp chl
1 40-59 27.5 yes 184
2 60-99 22.7  no 187
3 60-99 20.4  no 187
4 20-39 35.3  no 184
5 40-59 20.4  no 113
6 60-99 22.7  no 184
  imp.cat2$method
     age      bmi      hyp      chl 
  "polr"    "pmm" "logreg"    "pmm" 

11.7.2 Multi-level data

Multilevel data gets more tricky: should we impute taking into account cluster? How do we do that?

For an initial pass, I recommend simply doing regression imputation ignoring cluster/grouping, and then adding in a dummy variable of whether a value is imputed.

11.7.3 Longitudinal data

With longitudinal data we can often use all our data even for individuals with missing data on the outcome, if we assume data are MAR (“Missing at Random”). MAR means that conditional on the observed data, missingness may depend on any observed data, but not on unobserved data. we explore our missing data on individuals over time and on outcomes as above to get a sense of whether MAR is a reasonable assumption or not. Then lmer basically handles the rest for us, as far as we have enough observations per individual, on average, to estimate the number of random effects we are trying to estimate. With respect to missing data on covariates or predictors, you can handle those with one of the methods described above.

Here we show how to explore missing data in longitudinal analysis using data on toenail detachment, which you will see in the unit on generalized MLMs. The data is from a RCT where patients were getting a different type of drug to prevent toenail detachment (the outcome).

## load data
  toes = foreign::read.dta( "data/toenail.dta" )

First, let’s look at how many times patients were observed.

## how many time points per patient?
  table( table( toes$patient ) )

  1   2   3   4   5   6   7 
  5   3   7   6  10  39 224 

We have 224 patients observed at all 7 time points, and the rest of the patients are observed at fewer time points, between 1 and 6.

## define function 
summarise.patient = function( patient ) {
    pat = rep( ".", 7 )
    pat[ patient$visit ] = 'X'
    paste( pat, collapse="" )
}
  ## For each patient, this code makes a string of "." 
  ## then it replaces all dots with an "X" if we have data for that visit

## summarize missingness  
miss = toes %>% group_by( patient ) %>%
    do( pattern = summarise.patient(.) ) %>%
    unnest(cols = c(pattern))
  ## Group the data by patient 
  ## Then use the do() command on each chunk of our dataframe
  ## The "." means "the chunk" (it is a pronoun, essentially).  
  ## This code creates a list of character vectors
  ## The unnest() takes our character vector out of this list made by "do"

head( miss )
# A tibble: 6 × 2
  patient pattern
    <dbl> <chr>  
1       1 XXXXXXX
2       2 XXXXXX.
3       3 XXXXXXX
4       4 XXXXXXX
5       6 XXXXXXX
6       7 XXXXXXX

Here we see the different patterns of missing outcomes, i.e., when patients leave and if they come back. When patients leave and never come back, regardless of the time point (see lines 4 and 5), we have “monotone missingness.”

## sort missing patterns in decreasing order
## starting with no missingness 
sort( table( miss$pattern ), decreasing=TRUE )

XXXXXXX XXXXX.X XXXX.XX XXX.... X...... XXXXX.. XXXX... XX..... XXX.XXX XXXXXX. 
    224      21      10       6       5       5       4       3       3       3 
XXX.X.. XXXX..X X.XXXXX XX..X.. XX.XXX. XX.XXXX XXX..XX XXX.X.X 
      2       2       1       1       1       1       1       1 
## summarize number of data patterns 
miss = miss %>% group_by( pattern ) %>%
    summarise( n=n() )
miss = arrange( miss, -n )
miss
# A tibble: 18 × 2
   pattern     n
   <chr>   <int>
 1 XXXXXXX   224
 2 XXXXX.X    21
 3 XXXX.XX    10
 4 XXX....     6
 5 X......     5
 6 XXXXX..     5
 7 XXXX...     4
 8 XX.....     3
 9 XXX.XXX     3
10 XXXXXX.     3
11 XXX.X..     2
12 XXXX..X     2
13 X.XXXXX     1
14 XX..X..     1
15 XX.XXX.     1
16 XX.XXXX     1
17 XXX..XX     1
18 XXX.X.X     1
## percent missing data (224 complete cases)
224 / sum( miss$n )
[1] 0.762
  ## 76% of patients with complete data

We can look at patterns of missing outcomes by reshaping to wide, to have one column per time point, and then looking at the missing data patterns with md.pattern() from the mice package:

## reshape data to wide 
  dat.wide = reshape( toes2, direction="wide", v.names="outcome",
                    idvar="patient", timevar = "visit" )

## looking at missing data with mice package
  pat <- md.pattern( dat.wide )

Another way to generate missingness patterns is to create a function. The following function takes the visits and outcomes and puts a 1 or 0 if there is an outcome and a dot if missing:

make.pat = function( visit, outcome ) {
    pat = rep( ".", 7 )
    pat[ visit ] = outcome
    paste( pat, collapse="" )
}

## call our function on all our patients.
outcomes = toes %>% group_by( patient ) %>%
    summarise( tx = Tx[[1]],
               num.obs = n(),
               num.detach = sum( outcome ),
               out = make.pat( visit, outcome ) )

head( outcomes, 20 )
# A tibble: 20 × 5
   patient tx           num.obs num.detach out    
     <dbl> <fct>          <int>      <dbl> <chr>  
 1       1 Itraconazole       7          3 1110000
 2       2 Terbinafine        6          2 001100.
 3       3 Terbinafine        7          1 0000001
 4       4 Terbinafine        7          1 1000000
 5       6 Itraconazole       7          3 1110000
 6       7 Itraconazole       7          3 0000111
 7       9 Itraconazole       7          0 0000000
 8      10 Terbinafine        7          0 0000000
 9      11 Itraconazole       7          4 1111000
10      12 Terbinafine        7          3 0100110
11      13 Terbinafine        7          4 1111000
12      15 Itraconazole       6          2 11000.0
13      16 Itraconazole       6          1 0000.10
14      17 Terbinafine        6          4 11110.0
15      18 Terbinafine        6          1 001.000
16      19 Itraconazole       7          0 0000000
17      20 Terbinafine        6          0 000.000
18      21 Itraconazole       3          2 110....
19      22 Terbinafine        7          3 1110000
20      23 Itraconazole       7          0 0000000

11.8 Further reading

Some further reading on handling missing data. But this is really a course into itself.

  • Gelman & Hill Chapter 25 has a more detailed discussion of missing data imputation.

  • White IR, Royston P, Wood AM. Multiple imputation using chained equations: issues and guidance for practice. Statistics in Medicine 2011;30: 377-399.

  • Graham, JW, Olchowski, AE, Gilreath, TD, 2007. How Many Imputations are Really Needed? Some Practical Clarifications of Multiple Imputation Theory 206–213. https://doi.org/10.1007/s11121-007-0070-9

  • van Buurin S, Groothuis-Oudshoorn K, MICE: Multivariate Imputation by Chained Equations. Journal of Statistical Software. 2011;45(3):1-68.

  • Grund S, Lüdtke O, Robitzsch A. Multiple Imputation of Missing Data for Multilevel Models: Simulations and Recommendations. DOI: 10.1177/1094428117703686

11.9 Appendix: More about the mice package

The mice package gives back a very complex object that has a lot of information about how values were imputed, which values were imputed, and so forth. In the following we unpack the imp variable from above a bit more.

Looking at the imputation object

In the following code, we look at the object we get back from mice(). It has lots of parts that we can peek into.

First, the imp list inside of imp stores all of our newly imputed data. It is itself a list of each variable with their imputed values:

  imp$imp
$Ozone
      1   2   3   4   5
5     6  19  14   8  14
10   12  12   7  23  23
25   14  19  14  19  14
26   37  18  32  32  18
27   11   1  18  13  18
32   65  45  13  28  29
33   22  36  12  18  16
34   13  18   1  13  13
35   63  35  45  52  71
36   23  39  20  59  96
37   24  16  12  34  18
39   64 135  85  80  91
42  115  76 115  37  91
43   66 122  78  64 122
45   44  28  45  23  16
46   23  45  46  45  35
52   20  52  63  47  47
53   59  59  48 115  37
54   40  16  35  37  63
55   40  35  48  39  49
56   23  39  59  59  16
57   44  52  40  52  20
58   30  30  27  14  23
59   45  32  16  16  46
60   44  27  34  28  30
61   89  64  80  37  64
65   16  16  14  23  29
72   46  52  65  45  35
75   35  64  71  18  78
83   20  40  71  46  59
84   28  63  37  29  63
102 115  78  78  37  66
103  46  29  31  23  40
107  16  30  13  14  22
115  41  12  44   7  22
119  50  78 122  85  50
150  24  12  27  21  12

$Solar.R
     1   2   3   4   5
5    7 313  82  13 314
6  322 187 222  24 238
11  66 274 139 135 112
27  20  24   7 238 193
96 175 223 284 197 220
97  51 139 274 237  98
98  98 203 220 188 276

$Wind
[1] 1 2 3 4 5
<0 rows> (or 0-length row.names)

$Temp
[1] 1 2 3 4 5
<0 rows> (or 0-length row.names)

$Month
[1] 1 2 3 4 5
<0 rows> (or 0-length row.names)

$Day
[1] 1 2 3 4 5
<0 rows> (or 0-length row.names)
  str( imp$imp )
List of 6
 $ Ozone  :'data.frame':    37 obs. of  5 variables:
  ..$ 1: int [1:37] 6 12 14 37 11 65 22 13 63 23 ...
  ..$ 2: int [1:37] 19 12 19 18 1 45 36 18 35 39 ...
  ..$ 3: int [1:37] 14 7 14 32 18 13 12 1 45 20 ...
  ..$ 4: int [1:37] 8 23 19 32 13 28 18 13 52 59 ...
  ..$ 5: int [1:37] 14 23 14 18 18 29 16 13 71 96 ...
 $ Solar.R:'data.frame':    7 obs. of  5 variables:
  ..$ 1: int [1:7] 7 322 66 20 175 51 98
  ..$ 2: int [1:7] 313 187 274 24 223 139 203
  ..$ 3: int [1:7] 82 222 139 7 284 274 220
  ..$ 4: int [1:7] 13 24 135 238 197 237 188
  ..$ 5: int [1:7] 314 238 112 193 220 98 276
 $ Wind   :'data.frame':    0 obs. of  5 variables:
  ..$ 1: logi(0) 
  ..$ 2: logi(0) 
  ..$ 3: logi(0) 
  ..$ 4: logi(0) 
  ..$ 5: logi(0) 
 $ Temp   :'data.frame':    0 obs. of  5 variables:
  ..$ 1: logi(0) 
  ..$ 2: logi(0) 
  ..$ 3: logi(0) 
  ..$ 4: logi(0) 
  ..$ 5: logi(0) 
 $ Month  :'data.frame':    0 obs. of  5 variables:
  ..$ 1: logi(0) 
  ..$ 2: logi(0) 
  ..$ 3: logi(0) 
  ..$ 4: logi(0) 
  ..$ 5: logi(0) 
 $ Day    :'data.frame':    0 obs. of  5 variables:
  ..$ 1: logi(0) 
  ..$ 2: logi(0) 
  ..$ 3: logi(0) 
  ..$ 4: logi(0) 
  ..$ 5: logi(0) 
  str( imp$imp$Ozone )
'data.frame':   37 obs. of  5 variables:
 $ 1: int  6 12 14 37 11 65 22 13 63 23 ...
 $ 2: int  19 12 19 18 1 45 36 18 35 39 ...
 $ 3: int  14 7 14 32 18 13 12 1 45 20 ...
 $ 4: int  8 23 19 32 13 28 18 13 52 59 ...
 $ 5: int  14 23 14 18 18 29 16 13 71 96 ...

We see that Ozone and Solar.R have imputed values, and the other variables do not.

Next, we see two missing observations in our original data and then see the two imputed values for these two missing observations.

  airqualitysub$Ozone
 [1] 41 36 12 18 NA 28 23 19  8 NA
  imp$imp$Ozone[,1]
 [1]   6  12  14  37  11  65  22  13  63  23  24  64 115  66  44  23  20  59  40
[20]  40  23  44  30  45  44  89  16  46  35  20  28 115  46  16  41  50  24

We can make (the hard way) a vector of Ozone by plugging our missing values into the original data. But the complete() method, above, is preferred.

  oz = airqualitysub$Ozone
  oz[ is.na( oz ) ] = imp$imp$Ozone[,1]
Warning in oz[is.na(oz)] = imp$imp$Ozone[, 1]: number of items to replace is
not a multiple of replacement length
  oz
 [1] 41 36 12 18  6 28 23 19  8 12

What else is there in imp?

  names(imp)
 [1] "data"            "imp"             "m"               "where"          
 [5] "blocks"          "call"            "nmis"            "method"         
 [9] "predictorMatrix" "visitSequence"   "formulas"        "calltype"       
[13] "post"            "blots"           "ignore"          "seed"           
[17] "iteration"       "lastSeedValue"   "chainMean"       "chainVar"       
[21] "loggedEvents"    "version"         "date"           

What was our imputation method?

  imp$method
  Ozone Solar.R    Wind    Temp   Month     Day 
  "pmm"   "pmm"      ""      ""      ""      "" 

Mean imputation for each variable with missing values. Later this will say other thing.

What was used to impute what?

  imp$predictorMatrix
        Ozone Solar.R Wind Temp Month Day
Ozone       0       1    1    1     1   1
Solar.R     1       0    1    1     1   1
Wind        1       1    0    1     1   1
Temp        1       1    1    0     1   1
Month       1       1    1    1     0   1
Day         1       1    1    1     1   0

11.10 Appendix: The amelia package

Amelia is another multiple imputation and missing data package. We do not prefer it, but have some demonstration code in the following, for reference.

  library(Amelia)

For missingness we can make the following:

  missmap(airquality)

Each row of the plot is a row of the data, and missing values are shown in brown. But ugly! And hard to see any trends in the missingness.

You can use the Amelia package to do mean imputation.

  library(dplyr)

## exclude variables that do not vary
  a.airquality = airquality %>% dplyr::select(-Month)

## impute data
  a.imp <- amelia(a.airquality, m=5)
-- Imputation 1 --

  1  2  3  4  5  6

-- Imputation 2 --

  1  2  3  4  5  6

-- Imputation 3 --

  1  2  3  4  5

-- Imputation 4 --

  1  2  3  4  5  6  7

-- Imputation 5 --

  1  2  3  4  5  6
  a.imp

Amelia output with 5 imputed datasets.
Return code:  1 
Message:  Normal EM convergence. 

Chain Lengths:
--------------
Imputation 1:  6
Imputation 2:  6
Imputation 3:  5
Imputation 4:  7
Imputation 5:  6

We can plot our imputed values against our observed values to check that they make sense. We will do this for just one of five datasets we just imputed using Amelia.

## put imputed values from the third dataset in an object
  one_imp <- a.imp$imputations[[3]]$Ozone

## make object with observed values 
## from observations without missing Ozone values
  obs_data <- a.airquality$Ozone 
  
## make a plot overlaying observed and imputed values
  hist(one_imp[is.na(obs_data)], prob=TRUE, xlab="Ozone",
       main="Histogram of Imputed Values in 3rd Imputation \nCompared to Density in Observed Data",
       col="cyan", ylim=c(0,0.02))
  lines(density(obs_data[!is.na(obs_data)]), col="darkblue", lwd=2)

You can also do multiple imputation in Amelia. However, Amelia does not have an easy way to combine the estimates from the imputed datasets (no analogue of with() in mice). You can write a function that fits your model of interest in each imputed dataset and then use a package like mitools to pool the estimates and variances.

Much easier to use mice!

Aside: A more important limitation of Amelia is that the algorithm it uses to impute missing values assumes multivariate normality, which is often questionable, especially when you have binary variables.