“Because mixed models are more complex and more flexible than the general linear model, the potential for confusion and errors is higher.”
- Hamer & Simpson (2005)
Linear Mixed Models, as implemented in SAS’s Proc Mixed, SPSS Mixed, R’s LMER, and Stata’s xtmixed, are an extension of the general linear model. They use more sophisticated techniques for estimation of parameters (means, variances, regression coefficients, and standard errors), and as the quotation says, are much more flexible.
Here’s one example of the flexibility of mixed models, and its resulting potential for confusion and error.
In repeated measures and longitudinal studies, the observations are clustered within a subject. That means the observations, and their residuals, are not independent. They’re correlated. There are two ways to deal with this correlation.
The Marginal Model
One is to alter the covariance structure of the residuals. What this means is that instead of assuming that all observations are independent, as you do in a linear model, you assume the residuals from a single subject are related. Their covariances are non-zero. So you have to estimate the covariances among all the residuals from a single subject.
This approach is called a Marginal or Population Averaged approach. It’s not truly a mixed model, although you can use Mixed procedures to run them. You get these models in SAS Proc Mixed and SPSS Mixed by using a repeated statement instead of a random statement.
The Mixed Model
The other way to deal with non-independence of a subject’s residuals is to leave the residuals alone, but actually alter the model by controlling for subject. When you control for subject as a factor in the model, you literally redefine what a residual is. Instead of being the distance between a data point and the average for everyone, it’s the distance between a data point and the mean for that subject.
You could, theoretically, include Subject as a fixed factor, but that usually uses up most of the degrees of freedom. If instead, you treat Subject as a random factor, you are still controlling for Subject, you’re still able to redefine the residuals and deal with non-independence, while using up only a few degrees of freedom.
Because the model now contains both fixed and random effects, it is now officially a Mixed Model. You get these models in SAS Proc Mixed and SPSS Mixed by using a random statement.
Putting them together
Most of the time, controlling for Subject is enough to deal with all the non-independence of the residuals for each subject.
But every once in a while it’s not. If there is extra non-independence (or even non-constant variance) among the residuals, you can still estimate those non-zero covariances by adding a Repeated statement.
It’s fine to include a Repeated statement right along with a Random statement, and is sometimes necessary to have a good fitting model. The repeated statement still controls the covariance structure of the residuals for a single subject. It’s just that now those residuals have been redefined as the distance between each point and the subject’s mean. In the Marginal model, they’re not. They still represent the distance between each point and the overall mean.
If you want to learn more about mixed models, check out the recording of my Random Intercept and Random Slope Models webinar. These two models are the basic building blocks of all mixed models.
Get it all here. It’s free.





{ 11 comments… read them below or add one }
Hi Karen,
Firstly, can I congratulate you on your website. Up until now, Julie Pallant’s SPSS Survival Manual had been my best friend. Your webinars/articles have explained many more topics not covered in that book in a similarly understandable way. I just wish Id stumbled across your resources sooner.
In trying to get my head around linear mixed models I’ve watched 3 of your webinars (Running Repeated Measures as a Mixed Model, Rand-Int-Rand-Slope, & the Fixed vs Random factors in mixed models). Given how much help Ive already received I hesitate to ask for more.
Also this question is a little tangential to the topic above, but I thought other readers might share an interest in the answer.
Whilst I understand how mixed models do a great job of controlling for Subject with repeated-measures, Im still unclear on the role of Covariates. Recently I’ve read several reports of clinical trials where in the ‘Statistical analysis’ section of the paper, the authors describe inclusion of various covariates in the LMM (eg: age, gender, the baseline value of the DV, change in heart rate, etc).
I appreciate that you can never know exactly what authors have done, but I was just wondering how these common covariates might typically be included in a simple model assessing the efficacy of a treatment intervention (eg: 2 measurement time points pre & post; 2 groups) in SPSS Mixed. For time-invariant factors (eg: age and gender), the motivation is presumably to adjust for potential group differences in age/gender at baseline (at least that’s my interpretation). Would this involve including the interaction of group*age and group*gender as fixed factors? Should ‘age’ and ‘gender’ also be included as fixed factors alone or just the interaction?
As for time-varying covariates (eg: change in resting heart rate [HR] between baseline and follow-up), would this simply involve inclusion of 1 additional fixed factor (group*time*HR) in the model?
Finally, how would you go about including the baseline value of the DV as a covariate? Is this already done ‘automatically’ by including the intercept as a random factor?
Lastly, the obvious question is how to interpret whether the treatment is still effective after inclusion of the covariates. Is this still solely based on a significant group*time interaction? (ie: even if some of the covariates are significant fixed effects?)
My understanding is that a random slope cannot also be included due to only having 2 time points.
Many thanks again for all your work.
Hi Julian,
First, thanks for the kind words. I’m glad you’re finding my resources helpful.
There are a lot of questions here, and the answers are a bit involved. I’ll do my best to answer them concisely here.
Yes, mixed models control for subject by including it as a random factor, but they can also control for both time-varying and time-invariant covariates by including them as fixed. Baseline demographics and values of the DV are common covariates. Controlling for them in this way gives you means of the DV over time adjusted for where people started out.
Both types of covariates can interact with group or time. The former interaction says that the effect of the baseline predictor differs in the two groups. The latter says the average change in the DV over time is different in the two groups. Any interaction with time describes differences in the growth. I use growth generically–the DV doesn’t have to go up over time.
To test whether the treatment is still effective after inclusion of covariates (ie. intervention group has a higher mean than control), include covariates in the fixed statement, then add group.
If you want to test if the *growth* over time in the treatment group is higher than the control, include a group*time interaction.
Yes, you’re limited with only two time points–no random slope models and if you include the baseline DV as a covariate, you can’t use it as an outcome. So you’ll have only one time point as an outcome value. So no random intercept either.
Karen
Thanks Karen – I hadn’t intended to ask so many questions when I began writing the comment
I appreciate you taking the time to reply.
Hi Karen,
Following your comment on the limitation of no random slope models with two time points (there is also a reference about that in your “Random intercept -Random Slope Webinar), I am a bit confused.
If there are only two time points (or any 2-group categorical Ind.Variable, e.g. gender), my impression is that there could still be a random slope model. In other words, the slope representing the difference between Time 1 and Time 2 (or between Group A and Group B of the categorical variable) could still differ and be random among the clusters. However, as far as I understand, if the variance in slope is accounted for in the model (in the random statement), a different slope would be fitted within each cluster, this would result in no residuals (i.e., perfect fit since there are only two Times or Groups), but this would still incorporate a random slope in the model.
Does this make sense or do I have it all wrong ?
Thanks,
Michael
Hi Karen
I have conducted a mixed model analysis on data from an RCT (intervention and control group) with a continuous DV assessed at 3 time points. I have generated the estimated marginal means, but want to report the effect size in terms of a confidence interval for the mean difference of the change in the DV between groups at Time 2, and separately at Time 3, relative to T1 (baseline). Is there any syntax to do this. If not, any suggestions as to what output from this analysis could be used?
Hi Helen,
Yes, there is a way.
First, I’m assuming you’re making time categorical to get the EMMEANS.
MIXED Response BY Time Trt
/Fixed Time Trt Time*Trt
/Method REML
/EMMEANS=Tables(Time*Trt) Compare(Trt)
/REPEATED=Time |Subject(SubID) Covtype(CS).
So this EMMeans statement will give you a confidence interval for the difference between the times (eg. Time 2 – Time 1) separately for each group. It won’t give you the difference in the differences.
However, that is exactly what the parameter estimates for the regression coefficients tell you–the difference in the differences. It prints out a confidence interval by default.
Just add:
/PRINT=SOLUTION
You’ll have to make sure that Time 1 is the reference group in the dummy codes. This isn’t the default in SPSS Mixed (it will make the highest value the reference group. You can change this in some regression procedures, but not in Mixed). The easiest way to do it is to recode time 1 to some value higher than 3.
So you’ll have two interaction terms in the Solutions for Fixed Effects table (it’s the regression coefficients). Those will give you the the difference between groups in the differences from time 2 to 1 and time 3 to 1.
Karen
Hi Karen,
Following up on your comment regarding the limitation of no random slope models with two time points (there is also a reference about that in your “Random intercept -Random Slope Webinar), I am a bit confused.
If there are only two time points (or any 2-group categorical Ind.Variable, e.g. gender), my impression is that there could still be a random slope model. In other words, the slope representing the difference between Time 1 and Time 2 (or between Group A and Group B of the categorical variable) could still differ and be random among the clusters. However, as far as I understand, if the variance in slope is accounted for in the model (in the random statement), a different slope would be fitted within each cluster, this would result in no residuals (i.e., perfect fit since there are only two Times or Groups), but this would still incorporate a random slope in the model.
Does this make sense or do I have it all wrong ?
Thanks !
Hi Michael,
No, you have it right. The problem with the random slope model in that situation is that it’s confounded with the residual. All models will automatically fit a residual variance, so you overspecify if you try to fit a random slope as well.
That said, you CAN fit a random slope if there are more than two observations at each time point (or two values of the categorical variable). In most repeated measures designs, this isn’t true. There is only one measurement per subject per time point.
However, in a clustered data situation (ie. the clusters are across individuals, not repeated measurements across time), it’s more common.
It can occur in repeated measures, too. It’s just less common.
Karen
Thank you a lot Karen
In order to make it crystal clear, one can technically fit a random slope model with a 2-group categorical variable (or two time points with one observation at each time point), but that would be considered incorrect because of the confounding issue with the residual variance and the overspecification that you mentioned. Is that right ?
Hi Karen,
I’m very interested in mixed models and I’ve been attending to all your webinars on this subject. However, I’m still not sure if I can use mixed models in my research. My independent variable is Ykijt, i.e. the investment made by investor k of country i in country j in year t. I want to check differences in Ykijt across different types of investors k. I’m also using covariates at the level Xit, Xjt and Xij and I also want to check differences in the influence of those covariates across different types of investors k. I also need to control for factors i, j and t, right? Do you think I can use a mixed model, with k as a fixed factor and i, j and t as random factors? Thanks a lot for all your help,
Best Regards,
Vanda
Hi Vanda,
Based on the way you described it, indeed it sounds like a mixed model. Assuming you have many investors from many countries i, you would make both investor and country of origin random.
It’s hard to say what should be random or fixed without knowing the exact data structure and research questions.
Karen