“Because mixed models are more complex and more flexible than the general linear model, the potential for confusion and errors is higher.”

– Hamer & Simpson (2005)

Linear Mixed Models, as implemented in SAS’s Proc Mixed, SPSS Mixed, R’s LMER, and Stata’s xtmixed, are an extension of the general linear model. They use more sophisticated techniques for estimation of parameters (means, variances, regression coefficients, and standard errors), and as the quotation says, are much more flexible.

Here’s one example of the flexibility of mixed models, and its resulting potential for confusion and error.

In repeated measures and longitudinal studies, the observations are clustered within a subject. That means the observations, and their residuals, are not independent. They’re correlated. There are two ways to deal with this correlation.

## The Marginal Model

One is to alter the covariance structure of the residuals. What this means is that instead of assuming that all observations are independent, as you do in a linear model, you assume the residuals from a single subject are related. Their covariances are non-zero. So you have to estimate the covariances among all the residuals from a single subject.

This approach is called a Marginal or Population Averaged approach. It’s not truly a mixed model, although you can use Mixed procedures to run them. You get these models in SAS Proc Mixed and SPSS Mixed by using a repeated statement instead of a random statement.

## The Mixed Model

The other way to deal with non-independence of a subject’s residuals is to leave the residuals alone, but actually alter the model by controlling for subject. When you control for subject as a factor in the model, you literally redefine what a residual is. Instead of being the distance between a data point and the average for everyone, it’s the distance between a data point and *the mean for that subject*.

You could, theoretically, include Subject as a fixed factor, but that usually uses up most of the degrees of freedom. If instead, you treat Subject as a random factor, you are still controlling for Subject, you’re still able to redefine the residuals and deal with non-independence, while using up only a few degrees of freedom.

Because the model now contains both fixed and random effects, it is now officially a Mixed Model. You get these models in SAS Proc Mixed and SPSS Mixed by using a random statement.

## Putting them together

Most of the time, controlling for Subject is enough to deal with all the non-independence of the residuals for each subject.

But every once in a while it’s not. If there is extra non-independence (or even non-constant variance) among the residuals, you can still estimate those non-zero covariances by adding a Repeated statement.

It’s fine to include a Repeated statement right along with a Random statement, and is sometimes necessary to have a good fitting model. The repeated statement still controls the covariance structure of the residuals for a single subject. It’s just that now those residuals have been redefined as the distance between each point and the subject’s mean. In the Marginal model, they’re not. They still represent the distance between each point and the overall mean.

{ 68 comments… read them below or add one }

Hi Karen,

I watched your webinar on random slopes and I am wondering… if your dependent variable is not linear with time, is it wise to include TIME (categorical) as a RANDOM factor? You will get a ton of covariances and variances and most of the time convergence will not be reached because the model is to complex.

Best,

Sara

Hi Sara,

You could make time categorical instead of continuous, but that can cause other problems. If you include a random slope for a categorical predictor, and there is only one response per subject per category, then that random slope becomes confounded with the residual.

But you can include a quadratic term for time, if the non-linearity fits a curve of that time. If I had more time points, I could try other non-linear terms.

Hi Karen

In the past I used to include site and/or treatment-by-site interaction in the model to reduce the variability. At that time I treat site as fixed effect. Now I have a study with many small sites. I cannot include them as fixed effects because it takes away too many degrees of freedom. How would l write PROC MIXED to include site as random effects? Simply RANDOM SITEID SITEID*TRT? Any options at the end of the statement? What about using GLM statement, could it work for this example?

Thanks,

Hi Alfred,

Yes, you want to use mixed, not glm. GLM can technically do it, but it estimates everything different than MIXED and is considered not as accurate.

As for how to write the random statement, I can’t tell you the correct way to do it without knowing your exact design, but yes, I would start with what you’ve got. If you want guidance on the exact way to do it, that’s exactly what we’ve set up our Statistically Speaking program for. It’s an inexpensive way for you to get specific help where we can ask you all the design questions.

Hi Karen,

Thanks for all these nice workshops, they are indeed extremely helpful. Sadly, my question is on repeated measurements, and I actually do not have the time to wait for your nice workshop which I will certainly attain.

I want to explore the effect of the average prenatal maternal stress (cortisol level, continuous measure) on offspring growth during a linear growth period (monthly body size measure, N = 17 infants, 16.7+-1.3 body size measures per infant). So my basic model is

lm(BodySize ~ PrenatalStress + Age + PrenatalStress*Age)

but this one does not control for repeated measurements/dependent data. In theory, I could add infant-ID as a random slope factor, but this would take away the variance between the infants which is what I´m interested in (indeed, “PrenatalStress” has only on value per infant, and R informs me that the mixed model (lme4) is nearly unidentifiably etc.).

Therefore, I´m now looking for a method that controls for repeated measurements without taking away the between-subject variance. I was thinking about running a GEE with geepack or a GLS (package nlme) with infant-ID as “repeated statement” similar to SPSS but I´m not sure if this is what I need. All 3 methods bring up very different results although the direction of the estimated coefficients is identical. So the formulas are actually:

GEE (geepack): geeglm(formula = BodySize ~ PrenatalStress + Age + PrenatalStress*Age, data = xdata, id = ID, corstr = “ar1”)

GLS (nlme): gls(BodySize ~ PrenatalStress + Age + PrenatalStress*Age, data=xdata, corr=corAR1(,form=~Age|ID))

(Translated in your example from the webinar: I have no “Rural”-variable, “County” is infant-ID, “PrevCollege” is PrenatalStress, and “JobsK” is BodySize (and “Time” is Age), and I´m interested in how slope differences between Counties are predicted by PrevCollege)

Best wishes, and thank you so very much!

Andreas

Hi Karen,

I have run a mixed model in SAS using Proc Mixed, with subject as a random effect. Now need to report the results but am having trouble doing so. I have always reported ANOVA results in APA style, where you include the degrees of freedom of the error term. Given that the SAS output doesn’t include any information on the error term, how should I go about reporting my results? If you have any suggestions, please let me know! I’d greatly appreciate it.

Thanks!

Ariel

Hi Karen,

thank you for clarifying these concepts on mixed models. I have a problem similar to some others reported prevously. I have two groups, and 3 within-subject factors, which are nested, i.e. I have two stimuli (factor1), for each stimuls 3 configurations (factor2), for each configuration I have 3 positions to consider (factor3). Assuming I am interested in main effects for those factors as well as between-groups effect, would this spss syntax work correctly model my data?

/FIXED= GROUPS + FACTOR1 + FACTOR2 + FACTOR3 | SSTYPE(3)

/RANDOM=INTERCEPT | SUBJECT(SUBID)

/REPEATED=FACTOR1*FACTOR2*FACTOR3 | SUBJECT(SUBID) COVTYPE(UN)

Best and thanks in advance,

Alessandro

Hi again Karen,

I forgot anothre question, what would be the difference if I were to skip asterisks in repeated statement?

I mean, what is the difference in which I am modelly my depende variable if I use

/REPEATED = FACTOR1*FACTOR2*FACTOR3 | SUBJECT(SUBID) COVTYPE(UN)

compared to

/REPEATED = FACTOR1 FACTOR2 FACTOR3 | SUBJECT(SUBID) COVTYPE(UN)

Best

Alessandro

Hi Karen,

I looked at several of your webinars. Very helpful. Thanks a lot!

I frequently use MIXED to analyze data with one or multiple sources of non-independence. However, I run into problems with designs that contain only dichotomous within-subject variables and only one data point per cell of the design per subject. In these designs, the residuals are zero (the level-1 models perfectly fit the data). I understand that technically, such a linear mixed-effects models are not identifiable. They would be identifiable, however, if I could fix the parameter for the variance of the residuals to zero. Is there a way to do that in SPSS and SAS?

I could, of course, transform my data into wide format and analyze with a GLM procedure (e.g., proc reg, GLM) but it seems bizarre to have to go through the tedious data restructuring process and use different commands for a certain type of design that is in fact quite similar to other designs that can easily be analyzed with MIXED.

I tried a number of things (e.g., not including any random slopes, not including the random slope for the highest order interaction effect), but none of them gave me the “right” values for the inferential statistics. Take a 2 x 2 within-subjects ANOVA with one data point per cell of the design from each participant. By transforming the data into wide format and using a standard GLM procedure I can obtain the “right” F- and p-values.

I have not found a way to obtain the same values with the data in long format (i.e., four lines per participant) and using MIXED. It doesn’t matter which random effects structure I specify … I am not getting the “right” F- and p-values.

Do you know how to fix the parameter for the variance of the residuals to zero? In SPSS? In SAS?

Thanks a lot for your insight,

— Markus

Hi Markus,

You have great timing. We just covered this topic yesterday in my Analyzing Repeated Measures Data workshop.

You can’t specify 0 variance for the residuals in any model. Residual variance must exist in any model. That said, you should still be able to do it.

You’re right that you can’t specify a random slope where the repeat is categorical (as in an anova) and there is only one value per time point. You should be able to get a random intercept model to fit, though. This is equivalent to a repeated measures model with compound symmetery of the R matrix. Since you can’t specify compound symmetry in GLM, you may not get the exact same results.

However, if you run MIXED with a repeated statement (it works the same in SAS and SPSS) instead of a random, you should be able to replicate your GLM results. Note that they may differ if there is any missing data. If there are, then the GLM results are biased, but MIXED results are not.

You do not want to use the default covariance structure though. If you change it to UN, then you’ll get the same results as the multivariate output in GLM. To replicate the univariate results, you’d need to use a Huyhn-Feldt covariance structure and I’m not even sure that’s an option.

Don’t assume GLM is “right.” It is making some assumptions that MIXED doesn’t that don’t often hold.

So if I had a 2×2 ANOVA where both factors A & B were within subjects, this is how I would specify the repeated statement:

SPSS: repeated A*B|subject(subid) covtype(UN)

SAS: repeated A*B/subj=subid type=un

Unstructured may be more detailed than you need, however, so you may want to refine it from there.

Hello Karen,

thanks for this great article. It turns out, there is not much material that explains it as well as you do.

If you define your SUBJECT variable in the first dialog box of SPSS’ MIXED Procedure you do tell the Software within which variable the data are correlated and furthermore the residuals are not independent. The subject could be a child, measured multiple times. Why is it still necessary to define the REPEATED statement? What exactly is the difference between defining just the SUBJECT(child) – that of course I must do – and defining SUBJECT(child) and REPEATED(time)? I tried both ways and I got slightly different results.

It would be great if you could clarify that a little further, although you mentioned it already in the article.

Thank you and I hope you will keep providing us with your excellent material.

Marc

Hi Marc,

I never use the menus for SPSS Mixed because of this–it’s so unituitive and incredibly difficult to tell what it’s doing. So #1, use syntax.

Once you use syntax, you need to include a random statement and/or a repeated statement. I suspect what is happening in that situation you describe is in the one case you’re specifying both and in the other you are just specifying one. Both the random and repeated statements require you to define a subject.

So my suggestion is to try it the two ways you did in the menus and paste the syntax both times. See if they’re different.

Hello Karen,

In a usual proc mixed for repeated measure analysis, with intercept and time as random effects, and group and group*time as fixed effects- I have been asked to add gender in the model.

To the right handside of the model statement will add gender and gender*time.

the question is: add gender after modifying the intercepts.

how can I modify intercepts for gender? what should I do to the model?

Hi Hammam,

I’m not sure what you mean by modifying the intercepts. It sounds correct to add gender and gender*time to the model statement. I would also include time in there. It’s doing something different than time in the random statement and okay to have both.

Hi Karen,

Great website. I am in a similar situation to Lara. I have tested some brain-lesioned patients. Two of the patients have bilateral lesions so they have been tested twice (once for each hemisphere), and the other seven patients have only been tested once. So I have 11 ‘tested hemispheres’ from 9 patients. 9 control participants were tested in the same way (i.e., two of the controls were tested twice in the same way as the patients). The testing sessions themselves consisted of a 2x2x2 (say, A x B x C) repeated measures design with 24 repetitions of each condition.

Since the two sessions for the bilateral patients and their matched controls aren’t independent I have entered Participant as a random factor as well as Tested Hemisphere. A, B and C are entered as fixed factors that are repeated measures. Group (patient or control) is a fixed factor that is not repeated measures.

I get the same error as Lara saying “the levels of the repeated measures effect are not different for each case within a repeated subject”. So I created another variable ‘trial’ and entered that as a repeated measure and random factor. This is too much for my laptop to handle, although I might have better luck with my desktop at work. I’m just curious if this is the right approach as, like Lara, I am quite new to mixed models.

Thanks in advance!

Hi Karen,

Thanks for your great resource. I am trying to used the Generalized Mixed Model in SPSS (GENLINMIXED) to fit a binary logistic model to some accuracy data, and I have a couple of questions…I am only a beginner in all this!

My DV is accuracy on each trial (1=correct response, 0=incorrect response), and I have two repeated factors: congruency(0=incongruent, 1=congruent) and trial_type(1=blue, 2=green, 3=red). I have 29 participants who do 20 trials for each combination of levels for the repeated measures (therefore, I have 120 accuracy values for each participant; i.e. 20 values for incongruent-blue trials, 20 values for congruent-red trials, etc.). I want to add reaction time (RT, a continuous variable) into the model too as I want to see if the relationship between accuracy and RT changes between different levels of the repeated factors.

So, I have set up my analysis like so:

factor: accuracy score for each trial

random effects: Participant ID

fixed factors: congruency, trial_type and RT (main effects, all two-way and all three-way interactions)

->fitting a binomial distribution with Logit link function

However, this set-up gives an error saying “the levels of the repeated measures effect are not different for each case within a repeated subject”. I think this is because I have 20 cases per level of the repeated measure per participant.

I can only think of two ways to fix this:

1) average my accuracy scores across each repeated measure for each participant; but this means my data isn’t binomial any more and I lose the original structure

2) enter another random variable, which is ‘trial repetition’ (1-20); but when I run this, it takes two days and then says “iteration was terminated but convergence has not been acheived” and all my effects are zero.

I am very new to all this and have been reading and watching your tutorials but am still rather unclear about it all. Do I have the general set-up even vaguely right for what I am trying to do, or is there some fundamental error?

Many thanks!

Hi Karen,

thank you a lot for the useful website and the understandable explanations of some really complex topics!

I have two relatively simple questions, I hope I manage to ask them in a simple fashion 🙂

I have conducted a mixed model analysis on data from an RCT (verum and placebo treatment, 10 patients in each group) with a continuous DV assessed at 5 time points. The first two time points are before starting the therapy (the idea was to show that the patients were “stable”), the remaining three were after 1 week of therapy, after 2 weeks of therapy (then the therapy was suspended) and 2 weeks after the last therapy (to see if the expected amelioration of symptoms in the Verum group persisted).

My first question is whether I should use t0 as baseline, covariate and fixed and t1-t4 as repeated or, since t0 and t1 are almost identical, ignore t0 and use t1 as baseline and t2-t4 as repeated?

The second question is regarding the random effects: I tried to specify the intercept as random, but the result is either a convergence problems or a Hessian Matrix warning (depending on the covariance structure for the repeated measurements). Do you think it is necessary for my model to use a random intercept? In a previous answer to a related question, you wrote that this problem might be due to the coding of t0…?

Thank you in advance for your time and help!

Filipa

Hi Karen,

I took your GLMM course a few years back. I have been reviewing some of the video material from the course over the last couple of days and I’m a bit puzzled by how you specify the repeated command in one of the examples we worked through (on the swallowing dataset).

When you went through the marginal model the repated was specified as:

/REPEATED Task|Subjects(ParticipantID*Trial) COVTYPE (AR1), but later no in the mixed modle module it was specified

/REPEATED Task*Trial|Subjects(ParticipantID) COVTYPE (CS).

My question is really about the selection of covariance structures but how to specify the repeats as Task*Trial, or ParticipantID*Trial.

It might actually be more helpful if I talked about the data I’m working with and why I”m confused here.

I”m working with time series data (i.e. repeated measurements) in a group of individuals. For each subject there is a Dependent variable (y) and various independent variables (lets call them a, b, and c). Each of the repeated measurements of are taken at fixed time interval apart and the ordering matters. So i’ve created another variable called Time coded as 1,2,3…

What I’m interested in is the differences in the slopes relating a, b, and c with y under two different study conditions. So I’ve specified a, b and c, as covariates, and the study condition as a fixed factor. I want the slopes and intercepts to be able to vary for each individual so I’ve also included intercept, a, b and c and subjects as random effects.

I also want to account for the the fact that all the measurements are reported measures within a subject and here is where I’m getting confused.

Do I specify as below:

option 1: /REPEATED=Time | SUBJECT(Subject*Condition) COVTYPE(AR1).

or

option 2: /REPEATED=Time*Condition | SUBJECT(Subject) COVTYPE(AR1).

If I specify according to option 1, then according to the model dimension table the model thinks there are twice the number of subjects then there really are but the model runs.

If I specify according to option 2, the model won’t converge but the number of subjects is correct?

I’m getting a bit confused because in the seminar material you also did both ways?

Any comment will be very helpful.

Hi Shieak,

This is hard to answer specifically so I’ll give you some overall explanations. You’re welcome to set up a consultation if you’d like to dig into specifics. Or join our Brown Bag program and come to the next Q&A session.

First, what that repeated statement is doing depends on the design, what’s in the fixed effects, and whether there is also a random statement. That changes what the residuals ARE.

Second, remember that the repeated statement is controlling the Sigma (aka R) matrix. The subject specification there is for what responses are correlated.

Hope that helps at all.

Karen

Hi Karen,

Sorry to have missed something.

I am using SPSS to run the mixed model by using the Linear Mixed Model option. There is a dialogue for repeated measurement at the beginning, do I need to specific the “ID” as subject and “Day” as repeated there? There is also a blog for random factor afterward, do I need to include “Day” as well?

I looked at the syntax and found

/Repeated

/Random

I am not sure what is the difference between these two in SPSS.

Thank you in advance for your help!

Best regards.

Oriole

Hi Oriole,

I know this isn’t what you’re asking, but that first dialog box is what makes the menus for mixed so confusing. I never use them.

The difference between the repeated and random statements is really the key to understanding this stuff, and it’s very complicated if you’re not already familiar with mixed models.

The short answer is the random statement controls the G matrix (random effects) and the repeated statement controls the R matrix (residuals). If that makes sense to you, good. If it doesn’t, I would recommend taking my repeated measures workshop. It’s 16 hours long b/c that’s how long it takes to really explain it in a way that non-statisticians get it. I show many examples of how these two approaches affect the analyses in different designs.

Hi Karen,

I am trying to analyse the data of my physiological experiment on seaweed. I got 4 temperatures and 3 aquariums were used as replicates for each temperature. There are 6-8 plants in each aquarium and I done repeated measurement for the growth rate at day 7, 13, 22.

I am not sure which are the random factors in the mixed model. Some people said time should be use as random factor, or I should use as repeated measurment and define each plant or aquarium as subject? I was really confused by the terms random factor, repeated measurement and covariance.

Thank you for your generous help!

Oriole

Hi Oriole,

You have a design with a few complications. Either plant or aquarium could be random, depending on a few very important details.

I would honestly suggest a consultation, but start with these two webinars, which will hopefully clear up a few things (they’re free):

https://www.theanalysisfactor.com/fixed-and-random-factors-in-mixed-models-what-is-the-difference/

https://www.theanalysisfactor.com/random-intercept-and-random-slope-models-webinar/

Hi Karen,

This thread has been really helpful, but I’m still not 100% sure how to implement the mixed models in my research. I am using SPSS and would love to know how to put in time-varying covariates into a repeated measures ANOVA.

For my repeated measures part I have a before and after and then I have a between subjects variable as well. That being said my co-variate varies between the before and after but there is only one place to put covariates in the repeated measures ANOVA.

I guess my main question is this: How can I attach the correct BEFORE covariate to the BEFORE dependent variable and the AFTER covariate to the AFTER dependent variable only?

Any help on this would be greatly appreciated! Let me know if my question wasn’t clear 🙂

Thanks again,

Gerri

Hi Gerry,

You can’t do it in repeated measures anova. That’s one of its disadvantages.

You have to do it in mixed, which requires setting it up in the long format. See these:

https://www.theanalysisfactor.com/advantages-of-repeated-measures-anova-as-a-mixed-model/

https://www.theanalysisfactor.com/wide-and-long-data/

Hi Karen,

I am attempting to analyse an RCT, which has measured fatigue as an outcome at 3 time points (baseline, 3 and 6 months post intervention). I have been reading about MLM and growth models and am confused about which is the most appropriate. I had run the following analysis;

MIXED Fatigue BY Group Time

/CRITERIA=CIN(95) MXITER(100) MXSTEP(10) SCORING(1) SINGULAR(0.000000000001) HCONVERGE(0,

ABSOLUTE) LCONVERGE(0, ABSOLUTE) PCONVERGE(0.000001, ABSOLUTE)

/FIXED=Group Time Group*Time | SSTYPE(3)

/METHOD=REML

/PRINT=DESCRIPTIVES SOLUTION TESTCOV

/RANDOM=INTERCEPT | SUBJECT(Participant_code) COVTYPE(UN)

/REPEATED=Time | SUBJECT(Participant_code) COVTYPE(AD1)

/SAVE=PRED RESID

/EMMEANS=TABLES(Group) COMPARE ADJ(SIDAK)

/EMMEANS=TABLES(Time) COMPARE ADJ(SIDAK)

/EMMEANS=TABLES(Group*Time) COMPARE(Group) ADJ(SIDAK)

/EMMEANS=TABLES(Group*Time) COMPARE(Time) ADJ(SIDAK) .

But now see that growth modelling may be a better option. I am however a bit confused about where group would be placed in the analysis as I build up from a first-order polynomial to second-order polynomial?

Any advice would be much appreciated.

Best wishes

Hayley

Hi Hayley,

I’m not entirely comfortable giving advice on how to set up multilevel models without asking you many questions. This is why I offer Quic Question Consultations–to make that easy.

That said, individual growth models are just a type of MLM–it’s not one or the other. If what you mean by a growth model is including a random slope for time (so the effect of time is different for each person), then you’ve got Group in the right place. That wouldn’t change.

Hi Karen,

This is the first time that I am going to run a Mixed Model analysis, and your website is very helpful for me. Thank you for that already!

However, I am still uncertain about some things.

I have done a study in which participants received a lot of images. For all the images they had to answer the same question. This is the DV. The reason I wanted to use Mixed Models instead of GLM (which I also have done), is because I wanted to take the different images into account. Is it than better to treat image as random factor (just as subject) or as repeated? Or both?

Thanks in advance!

Best,

Wieteke

Hi Wieteke,

It’s hard to tell without talking with you, but it sounds like a classic Crossed Random Effects Design.

See this: https://www.theanalysisfactor.com/multilevel-models-with-crossed-random-effects/

In other words, make both image and subject random effects.

Karen

Hi Karen,

I have some problems in interpretation of the coefficient for the interaction term. I understand that the interaction between time and group means the average change in the DV over time is different in the two groups.However, if the covariate is the time-varying continuous variable, then how to interpret the coefficient for “time*covariate”? If the covariate is fixed, I guess I can say that the average growth rate will differ by coefficient when the covariate differs by 1 uint. However, I am not sure if the covariate turns out to be time-varying covariate, can I still say that？ Thanks a lot for your help!

Cheers,

Jane

Hi Karen,

I’ve specified a growth model in which the primary outcome (DV) has 3 time points (t0,t1,t2), with random intercept and I included t0 (baseline DV) as a fixed effect or covariate as you will of growth in the DV. However, is it possible to include both a random intercept and the baseline DV as covariate in one model? Furthermore is including baseline DV as fixed effect ok if it is also part of the primary outcome (DV)? And is it true that a random intercept corrects for regression towards the mean? I hope you can help me. Thank you in advance.

Regards,

Mary

Well, I suppose it’s possible, but it may actually depend on whether you’ve coded t0 as 0 or some other value. Does it run without errors? I wouldn’t be surprised if you had convergence problems or an ugly Hessian Matrix warning.

At the very least, you may get a 0 variance estimate for the random intercept. I think you’d be better of removing it, as the random intercept is already taking care of the fact that some patients are starting out higher than others.

Hi Karen,

I am wondering if I can use Mixed models for the following design. I have a cross-linguistic study with the native language of subjects (German vs. Japanese) and the language of training stimuli (German vs. Japanese) as independent variables. After training, the subjects performed 2 tasks (tasks A & B). In each task, there were 3 sets of stimuli (sets a, b, & c). Every subject was tested with all 3 sets of stimuli in both tasks.

I have run correlational analyses on the subjects’ overall # of correct answers in tasks A & B and their performance for each set of stimuli. Within each set of stimuli, there’s a positive correlation between the subjects’ performance in tasks A & B. However, after plotting scatterplots, I found that the patterns for different groups of subjects (e.g., German speaker trained with German stimuli or German speaker trained with Japanese stimuli) were very different. For example, one group had a bimodal distribution (i.e., the data points either clustered around the higher ends for both tasks or the lower ends for both tasks) while another group had more evenly distributed points that spread along the regression line.

I then tried to examine whether the native language of the subjects and the language of training stimuli affected those distributions. Originally, I ran a loglinear analysis and used the speakers native language (categorical), the language of stimuli (categorical), the number of correct answers in task A (scale), and the number of correct answer in task B (scale) as factors. In that analysis, each subjects performance for different sets of stimuli were treated as being independent. However, a friend of mine pointed out that a person’s performance in different sets of stimuli might be somehow related.

Now, what I am trying to do is to include Subject & Set of stimuli as random variables in the log linear analyses. But I couldn’t find a way to do that using the log linear function of SPSS. After a long web search, I guess I might need to use the Mixed models to do it. But it looks like I can have only 1 dependent variable in the Mixed model, instead of including the # of correct answers in both tasks as dependent variables.

So here’s what I am planning to do. I guess I should assign Subject’s language & stimuli language to Fixed factors and use Subject & Set of Stimuli as random factors, select the number of correct answers in Task B as the dependent variable, and use the # of correct answers in Task A as a covariate.

Does this make sense or do you have any suggestion?

Thank you very much!

Eriko

Hi Eriko,

It’s really difficult for me to say what analysis you can use in any given study without asking you many, many questions and having a thorough understanding of the design, despite your very thorough explanation. I’d really have to meet with you in consultation.

I can tell you that indeed, you’ll need to use Mixed, not log linear. And what you need to do is stack the data for mixed to work, so that, as you said, you have only one DV. So each subject has two rows of data–one for each task.

Karen

Hi, Karen

If we use repeat measure for subject, and random effect for a second level variable , i.e. block. When running this with proc mixed, will the assumption of equal variance of subject is the same as the marginal model ? Do the subjects have to have a same number of measurements too?

Hi Irene,

What you’re essentially doing is running a marginal model at that lower level, so yes, everything from a marginal model applies there, but within a block. So the equal variance assumption would only apply within a block.

Thanks,

Karen

Hi Karen,

Thank you very much for starting this website.

In my experiment in which participants judge different emotional faces on several criteria, I use two measures of empathy because I predict these to be both of influence. The two questionnaire scales are weakly correlated but do not measure exactly the same. Can I include these two scales in my mixed model ANOVA or will they sort of average each other out and should I, if I want to use both measures of empathy run two separate models?

MIXED decision BY treatment target WITH empathic_concern perspective_taking

/CRITERIA=CIN(95) MXITER(100) MXSTEP(10) SCORING(1) SINGULAR(0.000000000001) HCONVERGE(0, ABSOLUTE) LCONVERGE(0, ABSOLUTE) PCONVERGE(0.000001, ABSOLUTE)

/FIXED=treatment target empathic_concern perspective_taking

treatment * target perspective_taking

treatment * empathic_concern

treatment * perspective_taking

empathic_concern * perspective_taking

treatment * perspective_taking * empathic_concern

| SSTYPE(3)

/METHOD=ML

/PRINT=CORB DESCRIPTIVES SOLUTION

/RANDOM=INTERCEPT | SUBJECT(id) COVTYPE(VC)

/EMMEANS=TABLES(OVERALL).

Thank you very much for your help.

best wishes,

Mariska

Hi Mariska,

I’m assuming the two variables you’re talking about are the predictors–perspective taking and empathic concern.

If they are weakly correlated, they won’t average each other out at all. They may overlap a small amount in the variance they explain, but just keep this in mind when you’re interpreting their effects. Having mildly correlated predictors in a model is quite common.

Best,

Karen

Hello Karen,

First I want to thank you for this very helpful website.

I am working with mixed models on longitudinal data and would really appreciate if you can help me with the dilemma (regarding time-variant PREDICTOR). Namely, how do I specify model to answer two different questions: (1) Is change in DV different for different levels of PREDICTOR and (2) Does CHANGE in PREDICTOR predict CHANGE in DV?

The scheme of my syntax is

mixed DV by gender with TIME PREDICTOR

/fixed TIME GENDER PREDICTOR TIME*PREDICTOR |sstype(3)

/random intercept TIME | subject (NR) covtype(un)

/print G testcov solution r

/method = reml

/repeated = TIME| subject (NR) covtype(diag).

This syntax (I think) answers the (1) question. How do I model (2) question?

Thank you in advance and kind regards,

Irma

Hi Irma,

I think you’ve got it already with the predictor*time interaction, but I’m not sure.

The way you’re wording it is throwing me off. When you say change in predictor, or change in DV, do you mean a difference in values or do you mean change over time?

Karen

Hi Karen,

Thank you very much for taking time to answer me.

I thought “change over time”. So, my first question is “Do current differences in PREDICTOR predict the slope of DV (e.g. Does motivation in any time-point predict subsequent increase in school-grades?). (note: I have 5 time-points in accelerated design) My second question is “Is change (over time) in PREDICTOR related to change (over time) in DV (eg. Does increase in motivation predict increase in school-grades above the “static” prediction of change in school grades from current motivation?)

Is it possible to distinguish these two questions with time-variant predictors? I have hundreds of literature pages on MLM, but I never found SPSS syntax sample on this issue – I would really appreciate if you can help me.

Thank you in advance,

Irma

Hi Irma,

This is such a great question that I had to think about it a while and even look it up, and I verified what I thought initially. The time*motivation term tests #1.

To answer question #2, I believe you need a latent growth model. This essentially means using the SEM approach (and software), and it’s the latent growth in the predictor that can then predict the growth in the DV. That’s the part that the regular mixed model isn’t getting at–the growth in the predictor.

I would highly recommend reading Singer & Willett’s “Applied Longitudinal Data Analysis.” There is a section in there on time-varying predictors and a full chapter on latent growth models.

And if anyone else out there reading this knows something I don’t, I welcome your input. 🙂

Karen

Dear Karen,

Thank you so much for your resources I find them extremely helpful. Now I’m struggling with a similar situation as the one described by Irma Brkovic – I want to test whether the amount of change in X affects the amount of change in Y, from time-0 to time-1. X does change significantly between time-0 to time-1. Basicly I want to know:

-does Y change between time-0 to time-1?

and, most importantly:

-can I attribute the amount of change in Y, to the amount of change in X between time-0 to time-1?

I wonder if it’s correct to use this syntax below; I have included the baseline measure of X and the amount of change in X as independent variables (X at time-1 = X@t0 + Change_X).

MIXED Y BY time WITH X@t0 Change_X

/FIXED=time X@t0 Change_X@t1 | SSTYPE(3)

/METHOD=REML

/PRINT=SOLUTION

/REPEATED=tijd | SUBJECT(PATID) COVTYPE(AR1).

To make this syntax work I had to recode Change_X from missing to 0, at time-0 because there is no Change_X at time-0.

Or do I need the SEM approach for this? And which software would be most suitable then (Amos, Mplus, Lisrel)?

Kind regards,

Vera

If I first find a significant change in Y:

MIXED Y BY time

/FIXED=time | SSTYPE(3)

/METHOD=REML

/PRINT=SOLUTION

/REPEATED=tijd | SUBJECT(PATID) COVTYPE(AR1).

and then I add baseline X and Change in X as predictors:

MIXED Y BY time WITH X@t0 Change_X

/FIXED=time X@t0 Change_X | SSTYPE(3)

/METHOD=REML

/PRINT=SOLUTION

/REPEATED=tijd | SUBJECT(PATID) COVTYPE(AR1).

it turns out that the effect of Change_X is significant, but the effect of time is now insignificant.

Can I state the change of Y, over time, is attributable (mediated) by change in X? And use a Sobel test, too? (time–>X, change_X–>Y)

Thanks in advance,

Vera

Hi Karen,

Thank you a lot for your quick reply.

What I’m really wondering is how I can interpret the covariance estimates when I used RANDOM statement with two or more variables.

For example, I want to see (1) if Y is affected by factor A and B (binary factors), (2) if the factors are related to variance, and (3) the proportion of effect on that variability. I have three Y measurements in 3 different time by week for each subject.

data set looks like as follows:

id week Y A B

1 1 10 1 1

1 2 15 2 1

1 4 23 2 1

2 1 20 1 2

2 2 14 1 1

2 4 28 2 2

…. and so on

proc mixed data=mydata ;

class id week A B;

model Y = week A B/ solution ;

repeated week / subject=id type=sp(pow)(week) ;

random A B/ subject=id ;

run;

And I got the following covariance estimates:

Cov Parm Subject Estimate

A id 30

B id 5

sp(pow) id 0.8

Residual 35

Then, can I say that the total variability is 30+5+35=70 and the proportion of the variability due to A and B is 30/70 and 5/70, respectively?

That’s all I can describe my questions as best as I can. I’m sorry to bother you, but I’m stuck almost for one month due to this mixed model. 🙁

Thank you and have a good weekend!

Best,

Angela

Hi Angela,

Honestly, it’s difficult for me to suggest how to specify a model without getting my hands a little dirty. I’d need to check a number of things on the output.

But these are some initial impressions by what you’ve posted.

You usually want to include a random intercept if you’re going to put in random slopes for A and B.

Doing this may mean you don’t need the repeated statement at all.

But as I said, I am not comfortable giving advice about specifying models without seeing the data. If you want help, you could always sign up for a Quick Question Consultation (in the consulting menu). That’s what they’re for–when you’re stuck on something.

Karen

Hi Karen,

I’m very interested in the mixed model approach. Now I’m a little struggling with doing analyses using mixed models. The of my research is to quantify the variability of measurements and see how much is the amount of contribution for each factors affecting that variability (All the factors are time-varying). I have the measurements in 3 time points.

For the covariance within subject I used REPEATED statement in PROC MIXED with sp(pow) type of covariance to account for the unequally spaced time interval, and also used RANDOM statement to estimate the variability by several factors. I wonder if it is correct.

In the case where multiple factors are in the RANDOM statement, how can I measure the proportion of factors related to the variability?

Thank you in advance. Your website is always really helpful.

Best,

Angela

Hi Angela,

It’s the random statement and the covariance estimates produced by it that are going to tell you about the proportion of variance explained by any given level. I’m not sure what you mean by multiple factors in the random statement–do you have multiple random slopes? Or do you have multiple random statements because you have, for example, a 3-level model?

It could very well be correct to use a sp(pow) covariance structure for the repeated along with a random statement–it’s hard to tell without seeing it. You really have to compare multiple models.

The basic approach you want is a bottom-up approach to model building. First, with no fixed effects in the model, figure out how the variation is split among the levels (ie. person-level variation, time-level variation). Decide if any random slopes make sense.

Then add the repeated statement to see if you can better account for any correlations among residuals.

Then add fixed effects to the model to see which, if any, variance is explained by those fixed effects.

I don’t want to say it’s a complicated process, but there are many issues to consider, and it’s more than I can do without understanding the variables and design in more detail.

In my Analyzing Repeated Measures Data workshop, the last session is on exactly this–building the model and seeing how to glean the information from it, but it’s only after 7 other sessions of going through what each piece means. For example, we spend one whole session just on the repeated statement and what it does. 🙂

Karen

Hi Karen,

I have only 2 time points in repeated measures. I want to know what factors or covariates at time 1 predict my DV at time 2 (3 months apart). Is the mixed model still appropriate? If so, when I list my covariates are they identified as covariates at time 1? Is my DV identified as the DV at time 2 or as the DV?

Can a random intercept model be used for the above?

Thanks.

Maureen

If you are only interested in the time 2 DV as the outcome, then you don’t need repeated measures. You can treat the time 1 measure of the DV as a covariate.

What you won’t be able to test though, is the change in the DV over time. If you need that to answer your research question, then you’ll need both the time 1 and time 2 measures as outcomes, and you need some sort of repeated measures–either a repeated measures GLM or a mixed model.

You could run a random intercept (using a random statement) or a marginal model (using a repeated statement). In a case with only two time points, you’ll generally get the same results. What you can’t run with two time points is a random slope model.

Karen

Your tutorials on mixed models are quite helpful! I am still a bit unclear about when it is appropriate to use the random statement, repeated statement, or both. Your explanation above helped me figure it out to a point, but I think I really need some examples to understand it fully. Are you aware of any resources that provide examples of a study design for each scenario? I would like to learn more about this topic but I am unsure where to start. Thanks!

Hi Lisa,

It’s because there isn’t always a right answer, and you can get the same answer from both in some situations. And in others, the best answer is to include both a random and a repeated statement.

It’s about which gives you the information you need, which gives the best fit and is most theoretically appropriate.

I would start with this article: http://www2.sas.com/proceedings/sugi25/25/aa/25p020.pdf. Another resource that I know talks about it is SAS Systems for Mixed Models. It’s a very thorough book.

But I have to say, I focus a LOT on this issue in my Analyzing Repeated Measures Data Workshop because I couldn’t find a good resource on it. It has taken me a long time to piece it together. So that is honestly my best suggestion.

We spend hours on this topic in the workshop, using 5 different designs, which is why it’s so hard to explain in an article! 🙂 Anyway, if you want more info, that workshop is being offered again in April. You can get more info here: http://www.theanalysisinstitute.com/workshops/Repeated-Measures/index.html.

Karen

Hi Karen,

I’m very interested in mixed models and I’ve been attending to all your webinars on this subject. However, I’m still not sure if I can use mixed models in my research. My independent variable is Ykijt, i.e. the investment made by investor k of country i in country j in year t. I want to check differences in Ykijt across different types of investors k. I’m also using covariates at the level Xit, Xjt and Xij and I also want to check differences in the influence of those covariates across different types of investors k. I also need to control for factors i, j and t, right? Do you think I can use a mixed model, with k as a fixed factor and i, j and t as random factors? Thanks a lot for all your help,

Best Regards,

Vanda

Hi Vanda,

Based on the way you described it, indeed it sounds like a mixed model. Assuming you have many investors from many countries i, you would make both investor and country of origin random.

It’s hard to say what should be random or fixed without knowing the exact data structure and research questions.

Karen

Hi Karen,

Following up on your comment regarding the limitation of no random slope models with two time points (there is also a reference about that in your “Random intercept -Random Slope Webinar), I am a bit confused.

If there are only two time points (or any 2-group categorical Ind.Variable, e.g. gender), my impression is that there could still be a random slope model. In other words, the slope representing the difference between Time 1 and Time 2 (or between Group A and Group B of the categorical variable) could still differ and be random among the clusters. However, as far as I understand, if the variance in slope is accounted for in the model (in the random statement), a different slope would be fitted within each cluster, this would result in no residuals (i.e., perfect fit since there are only two Times or Groups), but this would still incorporate a random slope in the model.

Does this make sense or do I have it all wrong ?

Thanks !

Hi Michael,

No, you have it right. The problem with the random slope model in that situation is that it’s confounded with the residual. All models will automatically fit a residual variance, so you overspecify if you try to fit a random slope as well.

That said, you CAN fit a random slope if there are more than two observations at each time point (or two values of the categorical variable). In most repeated measures designs, this isn’t true. There is only one measurement per subject per time point.

However, in a clustered data situation (ie. the clusters are across individuals, not repeated measurements across time), it’s more common.

It can occur in repeated measures, too. It’s just less common.

Karen

Thank you a lot Karen

In order to make it crystal clear, one can technically fit a random slope model with a 2-group categorical variable (or two time points with one observation at each time point), but that would be considered incorrect because of the confounding issue with the residual variance and the overspecification that you mentioned. Is that right ?

Hi Michael,

It won’t even run. You’d be trying to estimate the same parameters two different ways.

Hi Helen,

Yes, there is a way.

First, I’m assuming you’re making time categorical to get the EMMEANS.

MIXED Response BY Time Trt

/Fixed Time Trt Time*Trt

/Method REML

/EMMEANS=Tables(Time*Trt) Compare(Trt)

/REPEATED=Time |Subject(SubID) Covtype(CS).

So this EMMeans statement will give you a confidence interval for the difference between the times (eg. Time 2 – Time 1) separately for each group. It won’t give you the difference in the differences.

However, that is exactly what the parameter estimates for the regression coefficients tell you–the difference in the differences. It prints out a confidence interval by default.

Just add:

/PRINT=SOLUTION

You’ll have to make sure that Time 1 is the reference group in the dummy codes. This isn’t the default in SPSS Mixed (it will make the highest value the reference group. You can change this in some regression procedures, but not in Mixed). The easiest way to do it is to recode time 1 to some value higher than 3.

So you’ll have two interaction terms in the Solutions for Fixed Effects table (it’s the regression coefficients). Those will give you the the difference between groups in the differences from time 2 to 1 and time 3 to 1.

Karen

Hi Keren

Thanks for your excellent resources and I found they very useful. Regarding your response to Helen above, you did not specify the Random effect. Is there any specific reason for that?

Many thanks

Tony

Hi Karen

I have conducted a mixed model analysis on data from an RCT (intervention and control group) with a continuous DV assessed at 3 time points. I have generated the estimated marginal means, but want to report the effect size in terms of a confidence interval for the mean difference of the change in the DV between groups at Time 2, and separately at Time 3, relative to T1 (baseline). Is there any syntax to do this. If not, any suggestions as to what output from this analysis could be used?

Hi Karen,

Firstly, can I congratulate you on your website. Up until now, Julie Pallant’s SPSS Survival Manual had been my best friend. Your webinars/articles have explained many more topics not covered in that book in a similarly understandable way. I just wish Id stumbled across your resources sooner.

In trying to get my head around linear mixed models I’ve watched 3 of your webinars (Running Repeated Measures as a Mixed Model, Rand-Int-Rand-Slope, & the Fixed vs Random factors in mixed models). Given how much help Ive already received I hesitate to ask for more.

Also this question is a little tangential to the topic above, but I thought other readers might share an interest in the answer.

Whilst I understand how mixed models do a great job of controlling for Subject with repeated-measures, Im still unclear on the role of Covariates. Recently I’ve read several reports of clinical trials where in the ‘Statistical analysis’ section of the paper, the authors describe inclusion of various covariates in the LMM (eg: age, gender, the baseline value of the DV, change in heart rate, etc).

I appreciate that you can never know exactly what authors have done, but I was just wondering how these common covariates might typically be included in a simple model assessing the efficacy of a treatment intervention (eg: 2 measurement time points pre & post; 2 groups) in SPSS Mixed. For time-invariant factors (eg: age and gender), the motivation is presumably to adjust for potential group differences in age/gender at baseline (at least that’s my interpretation). Would this involve including the interaction of group*age and group*gender as fixed factors? Should ‘age’ and ‘gender’ also be included as fixed factors alone or just the interaction?

As for time-varying covariates (eg: change in resting heart rate [HR] between baseline and follow-up), would this simply involve inclusion of 1 additional fixed factor (group*time*HR) in the model?

Finally, how would you go about including the baseline value of the DV as a covariate? Is this already done ‘automatically’ by including the intercept as a random factor?

Lastly, the obvious question is how to interpret whether the treatment is still effective after inclusion of the covariates. Is this still solely based on a significant group*time interaction? (ie: even if some of the covariates are significant fixed effects?)

My understanding is that a random slope cannot also be included due to only having 2 time points.

Many thanks again for all your work.

Hi Julian,

First, thanks for the kind words. I’m glad you’re finding my resources helpful.

There are a lot of questions here, and the answers are a bit involved. I’ll do my best to answer them concisely here.

Yes, mixed models control for subject by including it as a random factor, but they can also control for both time-varying and time-invariant covariates by including them as fixed. Baseline demographics and values of the DV are common covariates. Controlling for them in this way gives you means of the DV over time adjusted for where people started out.

Both types of covariates can interact with group or time. The former interaction says that the effect of the baseline predictor differs in the two groups. The latter says the average change in the DV over time is different in the two groups. Any interaction with time describes differences in the growth. I use growth generically–the DV doesn’t have to go up over time.

To test whether the treatment is still effective after inclusion of covariates (ie. intervention group has a higher mean than control), include covariates in the fixed statement, then add group.

If you want to test if the *growth* over time in the treatment group is higher than the control, include a group*time interaction.

Yes, you’re limited with only two time points–no random slope models and if you include the baseline DV as a covariate, you can’t use it as an outcome. So you’ll have only one time point as an outcome value. So no random intercept either. 🙂

Karen

Thanks Karen – I hadn’t intended to ask so many questions when I began writing the comment 🙂 I appreciate you taking the time to reply.

Hi Karen,

Following your comment on the limitation of no random slope models with two time points (there is also a reference about that in your “Random intercept -Random Slope Webinar), I am a bit confused.

If there are only two time points (or any 2-group categorical Ind.Variable, e.g. gender), my impression is that there could still be a random slope model. In other words, the slope representing the difference between Time 1 and Time 2 (or between Group A and Group B of the categorical variable) could still differ and be random among the clusters. However, as far as I understand, if the variance in slope is accounted for in the model (in the random statement), a different slope would be fitted within each cluster, this would result in no residuals (i.e., perfect fit since there are only two Times or Groups), but this would still incorporate a random slope in the model.

Does this make sense or do I have it all wrong ?

Thanks,

Michael

Hi there,

This is a great resource!

I’m using mixed models for a slightly different purpose — I have a dataset with a number of identical twin pairs and fraternal twin pairs. I want to examine the relationship between two variables (let’s call them INDEPENDENT and DEPENDENT). However, I can’t run a normal OLS regression because each twin’s dependent variable is correlated with their co-twin’s.

The way I have been dealing with this is to use SAS PROC MIXED and include a random intercept defined by twin pair (FAMILYID). Here is my syntax:

proc mixed method=ml covtest noclprint;

class FAMILYID;

model DEPENDENT = INDEPENDENT/solution;

random intercept/sub=FAMILYID type=un gcorr;

run;

However, I’ve realized that I have a heteroskedasticity problem. The identical twins are likely to be more related to each other than the fraternal twins (variable indicating whether twins are fraternal or identical is called TWINTYPE), and the model doesn’t reflect this.

According to SAS documentation for the RANDOM statement: “GRP=effect defines an effect specifying heterogeneity in the covariance structure of G. All observations having the same level of the group effect have the same covariance parameters.”

So, now I’ve got this:

proc mixed method=ml covtest noclprint;

class FAMILYID;

model DEPENDENT = INDEPENDENT/solution;

random intercept/sub=FAMILYID group=TWINTYPE type=un gcorr;

run;

In researching this, I started reading more about the REPEATED statement and wondered if:

proc mixed method=ml covtest noclprint;

class FAMILYID;

model DEPENDENT = INDEPENDENT/solution;

random intercept/sub=FAMILYID group=TWINTYPE type=un gcorr;

repeated / subject=IDYRFAM group=TWINTYPE type=un r rcorr;

run;

would actually be the way to go? Would I be doing anything redundant by adding the REPEATED statement?

Hi Julia,

It’s hard to say as I’m not sure whether that subject variable in the repeated statement is the same as in the random. Let’s assume they are.

If that’s the case, my expectation is it won’t run. That is of course an empirical question.

Your second model sounds correct to me and you could replace the random statement with this repeated statement to get the same results:

repeated twintype/subject-familyid type=un r rcorr;

The un in a repeated statement will give a separate variance estimate to each twin, just as the group=twintype did it in the random statement.

Hi Vera,

You’ve got a lot going on there. I’d only be able to advise if I talked with you (and asked many questions). Mediation in repeated measures studies is tricky. I would start here: Judd, McClelland and Kenny (2001) “Estimating and Testing Mediation and Moderation in Within-Subject Designs.” Psychological Methods (Vol. 6, No. 2, 115-134).

My impression is a lot of journal editors no longer accept Sobel test results as the normality assumption of the indirect effect is difficult to meet.