nlme
for TSCS DataWe’re going to be analyzing data on voting rates (% of adult population voting) in US presidential elections from 1978 to 2014. Load the data files and combine them in preparation for the analysis.
library(tidyverse)
library(reshape2)
## load population data
pop <- read.csv('pop.csv') %>%
rename(year = V1, state = V2, population = pop) %>%
mutate(population = scale(population))
## load voting rate data
vote <- read.csv('vote.csv') %>% melt() %>%
mutate(variable = gsub('X', '', variable), variable = as.numeric(variable)) %>%
rename(year = variable, vote = value)
## load region data
regions <- read.csv('regions.csv') %>%
select(state = State, State.Code, region = Region)
## merge data
dat <- vote %>% left_join(regions) %>% select(-state) %>% rename(state = State.Code) %>% left_join(pop)
nlme
uses a slightly different syntax for mixed effects models than lme4
does. We specify the fixed part of the model using a regular formula like we would for lm()
, but the random component has to be specified as an argument to the random
option e.g. random = ~ 1| group
would produce a random intercept by group
just like (1 | group)
would in lme4
.
library(nlme)
## fit model
mod1 <- lme(vote ~ population + region, data = dat, random = ~ 1 | year)
## Error in na.fail.default(structure(list(vote = c(40.9, 49, 35.7, 36.5, : missing values in object
Uh oh, looks like we have some missing values in our data and lme()
doesn’t perform listwise deletion like lmer
does. Unfortunately, the error doesn’t tell us where the problem is in our data. Use apply()
and anyNA()
to figure out where we should start looking.
apply(dat, 2, anyNA)
## year vote state region population
## FALSE FALSE TRUE TRUE TRUE
It looks there are missing values for state
, region
, and population
. This means that there aren’t any issues with our voting data, but something’s going wrong with our region and population data. Since we join the region data before joining the population data, let’s start there.
dat <- vote %>% left_join(regions)
## list missing states
dat$state[is.na(dat$State.Code)]
## [1] "Rhode Island " "Utah " "Rhode Island " "Utah "
## [5] "Rhode Island " "Utah " "Rhode Island " "Utah "
## [9] "Rhode Island " "Utah " "Rhode Island " "Utah "
## [13] "Rhode Island " "Utah " "Rhode Island " "Utah "
## [17] "Rhode Island " "Utah " "Rhode Island " "Utah "
Our voting data have an extra trailing space for Rhode Island and Utah. left_join()
uses exact string matching, so this extra space is causing a problem. Look up the sub()
function by typing ?sub
. How can we use this to fix our data ensure that all our data get properly merged.
## drop trailing spaces in RI and UT
vote <- vote %>% mutate(state = sub('\\ $' , '', state))
## merge data
dat <- vote %>% left_join(regions) %>% select(-state) %>% rename(state = State.Code) %>% left_join(pop)
## check for NAs again
apply(dat, 2, anyNA)
## year vote state region population
## FALSE FALSE FALSE FALSE FALSE
Now we’re good and can go ahead and estimate our model.
library(texreg)
mod1 <- lme(vote ~ population + region, data = dat, random = ~ 1 | year)
htmlreg(mod1, stars = .05)
Model 1 | ||
---|---|---|
(Intercept) | 50.85* | |
(0.90) | ||
population | -2.36* | |
(0.27) | ||
regionNortheast | -3.26* | |
(0.86) | ||
regionSouth | -8.67* | |
(0.73) | ||
regionWest | -3.00* | |
(0.78) | ||
AIC | 3327.07 | |
BIC | 3356.65 | |
Log Likelihood | -1656.54 | |
Num. obs. | 510 | |
Num. groups | 10 | |
*p < 0.05 |
While nlme
isn’t as advanced as lme4
in a lot of ways, it has certain functionalities that offer much more control over your model. What if we’re worried that we not only need a random intercept by year, but that the errors might vary by year? We can account for this possiblity with the varIdent()
function and the weights
argument to lme()
.
mod2 <- lme(vote ~ population + region, data = dat, random = ~ 1 | year,
weights = varIdent(~ 1 | year))
htmlreg(mod2, stars = .05)
Model 1 | ||
---|---|---|
(Intercept) | 50.85* | |
(0.90) | ||
population | -2.36* | |
(0.27) | ||
regionNortheast | -3.26* | |
(0.86) | ||
regionSouth | -8.67* | |
(0.73) | ||
regionWest | -3.00* | |
(0.78) | ||
AIC | 3327.07 | |
BIC | 3356.65 | |
Log Likelihood | -1656.54 | |
Num. obs. | 510 | |
Num. groups | 10 | |
*p < 0.05 |
The previous model assumes that errors are correlated within years, but not between them. Let’s take things a step further and fit an explicit time series model to our data. Take a look at what the correlation
argument to lme()
does and use the corAR1()
function to fit a model with a first order autoregressive structure with states as the grouping variable.
mod3 <- lme(vote ~ population + region, data = dat, random = ~ 1 | state,
correlation = corAR1(form = ~ year | state))
htmlreg(mod3, stars = .05)
Model 1 | ||
---|---|---|
(Intercept) | 50.86* | |
(1.31) | ||
population | -3.17* | |
(0.55) | ||
regionNortheast | -3.18 | |
(2.01) | ||
regionSouth | -8.61* | |
(1.72) | ||
regionWest | -3.09 | |
(1.82) | ||
AIC | 3219.59 | |
BIC | 3253.39 | |
Log Likelihood | -1601.80 | |
Num. obs. | 510 | |
Num. groups | 51 | |
*p < 0.05 |
The default corStruct
for an lme
model is compound symmetry, which enforces all off diagonal entries of the variance-covariance matrix to be 0. This is a restrictive assumption, but it matches the assumption used by lmer()
. Fit the a model with random intecepts by year using both lmer()
and lme()
and compare the results.
library(lme4)
mod_nlme <- lme(vote ~ population + region, data = dat, random = ~ 1 | state,
correlation = corCompSymm())
mod_lme4 <- lmer(vote ~ population + region + (1 | state), data = dat)
htmlreg(list(mod_nlme, mod_lme4), stars = .05,
custom.model.names = c('<code>nlme</code>', '<code>lme4</code>'))
nlme
|
lme4
|
||
---|---|---|---|
(Intercept) | 50.86* | 50.86* | |
(1.31) | (1.31) | ||
population | -3.17* | -3.17* | |
(0.55) | (0.55) | ||
regionNortheast | -3.18 | -3.18 | |
(2.01) | (2.01) | ||
regionSouth | -8.61* | -8.61* | |
(1.72) | (1.72) | ||
regionWest | -3.09 | -3.09 | |
(1.82) | (1.82) | ||
AIC | 3219.59 | 3217.59 | |
BIC | 3253.39 | 3247.23 | |
Log Likelihood | -1601.80 | -1601.80 | |
Num. obs. | 510 | 510 | |
Num. groups | 51 | ||
Num. groups: state | 51 | ||
Var: state (Intercept) | 18.10 | ||
Var: Residual | 26.22 | ||
*p < 0.05 |
They’re identical! Well, at least the coefficient estimates, standard errors, variances of the random effects, and log-likelihood are identical. AIC and BIC are different because lmer()
and lme()
calculate the number of parameters in a model slightly differently, which we can verify by running AIC(mod_nlme, mod_lme4)
and comparing the degrees of freedom.
Since there are so many different ways we can specify a mixed effects model (different random intercepts and slopes, different correlation structures, different levels of groups, etc.), it’s important to think about how we decide between different specifications. The simplest way is to perform a likelihood ratio test with the anova()
function. Fit a regular linear model with the same fixed effects as one of your multilevel models and compare the two; remember to list the unrestricted model first!
mod_lm <- lm(vote ~ population + region, data = dat)
anova(mod_lme4, mod_lm)
## Data: dat
## Models:
## mod_lm: vote ~ population + region
## mod_lme4: vote ~ population + region + (1 | year)
## Df AIC BIC logLik deviance Chisq Chi Df Pr(>Chisq)
## mod_lm 6 3365 3390 -1676 3353
## mod_lme4 7 3331 3360 -1658 3317 35.9 1 2e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Our unrestricted model is more likely to have generated the data than our restricted one, so we should definitely be fitting some type of multilevel model to these data. However,
## parametric bootstrap of likelihood ratio test
lme.boot <- function(model_r, model_ur){
new_y <<- simulate(model_r, 1)[[1]] # simulate data according to null DGP
restricted_dev <- -2 * logLik(update(model_r, new_y ~ .))
unrestricted_dev <- -2 * logLik(update(model_ur, new_y ~ .))
return(restricted_dev - unrestricted_dev) # LRT under null
}
## 100 bootstraps
boot_dev_nre <- replicate(100, expr = lme.boot(mod_lm, mod_lme4))
## the boostrapped approximation: (this is preferred)
test_stat <- -2 * logLik(mod_lm) - -2 * logLik(mod_lme4)
mean(boot_dev_nre > test_stat) #approximate p-value
## [1] 0
Not a single bootstrapped likelihood ratio test is greater than our test statistic, so once again our unrestricted model better fits the data than the restricted one. Let’s try comparing the first order autoregressive correlation model with the heteroskedastic model.
## 100 bootstraps
boot_dev_nre <- replicate(100, expr = lme.boot(mod2, mod3))
## Error in simulate.lme(model_r, 1): models with "corStruct" and/or "varFunc" objects not allowed
Unfortunately simulate.lme()
doesn’t allow us to simulate outcomes from models with corStruct()
correlation structures, so we have to try another option.
anova(mod3, mod2)
## Model df AIC BIC logLik Test L.Ratio p-value
## mod3 1 8 3325 3359 -1654
## mod2 2 7 3327 3357 -1657 1 vs 2 4.2 0.041
Our unrestricted AR1 model is statistically more likely than the heteroskedastic model.