40  Example of a three-level longitudinal model

Author

Luke Miratrix

This will illustrate fitting a three level model (where we have time variation and then clusters) and extracting the various components from it. This is a rough document, but hopefully will be useful. This example is based on a dataset used in Rabe-Hesketh and Skrondal, chapter 8.10, but you don’t really need that text.

40.1 Load the data

We first load the data. Shoving a lot of things under the rug, we have five measurements on a collection of kids in Kenya across time. We are interested in the impact of improved nutrition. The children are clustered in schools. This gives a three-level structure. The schools were treated with different nutrition programs.

In the following we load the data and look at the first few lines. Lots of variables! The main ones are id (the identifier of the kid), treatment (the kind of treatment given to the school), schoolid (the identifier of the school), gender (the gender of the kid), and rn (the time variable). Our outcome is ravens (Raven’s colored progressive matrices asssessment).

kenya = read.dta( "data/kenya.dta" )

# look at first 9 variables
head( kenya[1:9], 3 )
  id schoolid rn relyear ravens arithmetic vmeaning dstotal age_at_time0
1  1        2  1   -0.15     15          5       25       6         7.19
2  1        2  2    0.14     19          7       39       8         7.19
3  1        2  3    0.46     21          7       33       7         7.19
# what times do we have?
table( kenya$rn ) #time

  1   2   3   4   5 
546 546 546 546 546 
length( unique( kenya$id ) )
[1] 546
length( unique( kenya$schoolid) )
[1] 12

We see we have 546 kids and 12 schools.

40.2 Plot the data

We can look at the data.

ggplot( data=kenya, aes( x=rn, y=ravens, group=id )  )+ 
            facet_wrap( ~ gender ) + 
            geom_line( alpha=0.3 )
Warning: Removed 114 rows containing missing values or values outside the scale range
(`geom_line()`).

or

ggplot( data=kenya, aes( x=rn, y=ravens, group=id )  )+ 
            facet_wrap( ~ schoolid ) + 
            geom_line( alpha=0.3 )
Warning: Removed 114 rows containing missing values or values outside the scale range
(`geom_line()`).

id.sub = sample( unique( kenya$id), 12 )
ken.sub = subset( kenya, id %in% id.sub )
ggplot( data=ken.sub, aes( x=rn, y=ravens, group=id )  )+ 
            facet_wrap( ~ id ) + 
            geom_line( alpha=0.3 )

We have lots of noise! But there is also a trend.

The progression of marginal means show there is growth over time, on average:

mosaic::favstats( ravens ~ rn, data=kenya )
Registered S3 method overwritten by 'mosaic':
  method                           from   
  fortify.SpatialPolygonsDataFrame ggplot2
  rn min Q1 median Q3 max mean   sd   n missing
1  1   1 16     17 19  28 17.3 2.56 537       9
2  2   0 16     17 19  28 17.7 2.79 529      17
3  3   0 16     18 20  30 18.3 3.04 523      23
4  4   4 17     18 20  30 18.6 2.97 513      33
5  5   7 17     19 21  31 19.5 3.10 496      50

(Using the mosaic package we can do this.)

The above also shows that we have some missing data, more as the study progresses.

We drop these missing observations:

kenya = subset( kenya, !is.na( ravens ) & !is.na( rn ) )

We have some treatments, which we order so control is first

str( kenya$treatment )
 Factor w/ 4 levels "meat","milk",..: 1 1 1 1 1 4 4 4 4 4 ...
levels( kenya$treatment )
[1] "meat"    "milk"    "calorie" "control"
kenya$treatment = relevel( kenya$treatment, ref = "control" )
levels( kenya$treatment )
[1] "control" "meat"    "milk"    "calorie"

40.3 The mathematical model

Let’s fit a random slope model.

Level 1: We have for individual \(i\) in school \(j\) at time \(t\): [ Y_{ijt} = +* (t-L) + _{ijt} ]

Level 2: Each individual has their own growth curve. Their curve’s slope and intercepts varies around the school means: [ =* + gender{ij} + u_{0ij} ] [ = + gender*{ij} + u_{1ij} ] We also have that \((u_{0ij}, u_{1ij})\) are normally distributed with some 2x2 covariance matrix. We are forcing the impact of gender to be constant across schools.

Level 3: Finally our school mean slope and intercepts are [ =* + w_{0i} ] [ =* + meat_j +* milk_j + calorie_j + w*{1i} ] For the rate of growth at a school we allow different slopes for different treatments (compared to baseline). The milk, meat, and calorie are the three different treatments applied. We also have that \((w_{0j}, w_{1j})\) are normally distributed with some 2x2 covariance matrix: [

\[\begin{pmatrix}w_{j0}\\ w_{j1} \end{pmatrix}\]

N((

\[\begin{array}{c} 0 \\ 0 \end{array}\]

), ) = N((

\[\begin{array}{c} 0 \\ 0 \end{array}\]

), _{sch} ) ]

The \(\mu_0\) and \(\mu_1\) are the slope and intercept for the overall population growth (this is what defines our marginal model).

We will use \(L = 1\) to center the data at the first time point (so our intercept is expected ravens score at onset of the study).

Conceptual question: Why do we not have treatment in the intercept for school? What would changing \(L\) do to our model and this reasoning?

40.4 Fit the model

library( lme4 )
kenya$rn = kenya$rn - 1 # center by L=1
M1 = lmer( ravens ~ 1 + rn + gender*rn + treatment:rn + (1+rn|schoolid) + (1+rn|id:schoolid), 
           data=kenya )
boundary (singular) fit: see help('isSingular')
display( M1 )
lmer(formula = ravens ~ 1 + rn + gender * rn + treatment:rn + 
    (1 + rn | schoolid) + (1 + rn | id:schoolid), data = kenya)
                    coef.est coef.se
(Intercept)         17.41     0.19  
rn                   0.59     0.08  
gendergirl          -0.30     0.20  
rn:gendergirl       -0.14     0.08  
rn:treatmentmeat     0.17     0.09  
rn:treatmentmilk    -0.13     0.09  
rn:treatmentcalorie -0.02     0.09  

Error terms:
 Groups      Name        Std.Dev. Corr  
 id:schoolid (Intercept) 1.40           
             rn          0.43     -0.09 
 schoolid    (Intercept) 0.45           
             rn          0.09     -1.00 
 Residual                2.31           
---
number of obs: 2598, groups: id:schoolid, 546; schoolid, 12
AIC = 12545.9, DIC = 12474
deviance = 12496.0 

Now let’s connect some pieces:

  • \(\mu_{00} = 17.41\) and \(\mu_{11} = 0.59\). The initial score for boys is 17.4, on average, with an average gain of 0.59 per year for control schools.
  • \(\gamma_{01} = -0.30\) and \(\gamma_{11} = -0.14\), giving estimates that girls score lower and gain slower than boys.
  • The school-level variation in initial expected Raven scores is 0.45 (this is the standard deviation of \(w_{0i}\)), relatively small compared to the individual variation of 1.40 (this is the standard deviation of \(u_{0ij}\)).
  • The correlation of the \(u_{0ij}\) and \(u_{1ij}\) is basically zero (estimated at -0.09).
  • The random effects for school has a covariance matrix \(\Sigma_{sch}\) of [ _{sch} = ] The very negative correlation suggests an extrapolation effect, and that perhaps we could drop the random slope for schools.
  • The treatment effects are estimated as \(\mu_{11}=0.17, \mu_{12}=-0.13\), and \(\mu_{13}=-0.02\).
  • P-values for these will not be small, however, as the standard errors are all 0.09.

We could try to look at uncertainty on our parameters using the confint( M1 ) command, but it turns out that it crashes for this model. This can happen, and our -0.99 correlation gives a hint as to why. Let’s first drop the random slope and then try:

M1B = lmer( ravens ~ rn + gender*rn + treatment:rn + (1|schoolid) + (1+rn|id:schoolid), 
           data=kenya )
display( M1B )
lmer(formula = ravens ~ rn + gender * rn + treatment:rn + (1 | 
    schoolid) + (1 + rn | id:schoolid), data = kenya)
                    coef.est coef.se
(Intercept)         17.39     0.17  
rn                   0.57     0.08  
gendergirl          -0.30     0.20  
rn:gendergirl       -0.14     0.08  
rn:treatmentmeat     0.22     0.10  
rn:treatmentmilk    -0.09     0.10  
rn:treatmentcalorie  0.02     0.10  

Error terms:
 Groups      Name        Std.Dev. Corr  
 id:schoolid (Intercept) 1.42           
             rn          0.44     -0.11 
 schoolid    (Intercept) 0.33           
 Residual                2.31           
---
number of obs: 2598, groups: id:schoolid, 546; schoolid, 12
AIC = 12544.4, DIC = 12478
deviance = 12498.9 
confint( M1B )
                      2.5 %   97.5 %
.sig01               1.1657  1.65435
.sig02              -0.3544  0.32871
.sig03               0.3075  0.54179
.sig04               0.0000  0.60033
.sigma               2.2312  2.39580
(Intercept)         17.0605 17.71696
rn                   0.4091  0.72814
gendergirl          -0.6870  0.09145
rn:gendergirl       -0.2879  0.00772
rn:treatmentmeat     0.0164  0.40811
rn:treatmentmilk    -0.2876  0.09811
rn:treatmentcalorie -0.1775  0.20453

We then have to puzzle out which confidence interval goes with what. The .sig01 is the variance of the kid (id:schoolid), which we can tell by the range it covers. Then the next must be correlation, and then the slope. This tells us we have no confidence the school random intercept is away from 0 (.sig04).

40.5 Some quick plots

We can look at the empirical bayes intercepts:

schools = data.frame( resid = ranef( M1 )$schoolid$`(Intercept)` )
kids = data.frame( resid = ranef( M1 )$id$`(Intercept)` )
resid = data.frame( resid = resid( M1 ) )
resids = bind_rows( school=schools, child=kids, residual=resid, .id="type" )
resids$type = factor( resids$type, levels = c("school","child", "residual" ) )

ggplot( resids, aes( x = type, y = resid ) ) +
  geom_boxplot() +
  coord_flip()

This shows that the variation in occasion is much larger than kid, which is much larger than school.

We can calculate all the individual growth curves and plot those:

kenya$predictions = predict( M1 )
ggplot( data=kenya, aes( x=rn, y=predictions, group=id ) ) +
    facet_wrap( ~ schoolid, labeller = label_both) +
    geom_line( alpha=0.3 )

Generally individual curves are estimated to have positive slopes. The schools visually look quite similar; any school variation is small compared to individual variation.