One of the most confusing things about mixed models arises from the way it’s coded in most statistical software. Of the ones I’ve used, only HLM sets it up differently and so this doesn’t apply.
But for the rest of them—SPSS, SAS, R’s lme and lmer, and Stata, the basic syntax requires the same pieces of information.
1. The dependent variable
2. The predictor variables for which to calculate fixed effects and whether those are categorical or continuous. Each software has a different way of specifying them, but they all need to know that.
3. The predictor variables for which to calculate random effects, the level at which to calculate those effects, and if there are multiple random effects, the covariance structure of those effects.
The confusion comes in when we specify the same predictor in both the fixed and random parts. The syntax makes it look like we’re specifying the same predictor as both fixed and random.
But we’re not. It’s not only okay, it’s often the only way to write the model appropriately.
Let’s take a very simple example. This is the same model I use in my free webinar Random Intercept and Random Slope Models. If you haven’t seen it and want more detail, you can get the recording here.
The basic idea, though, is we’re comparing the economic growth over 5 decades between Rural and Metropolitan counties.
Economic growth is the outcome, measured in thousands of jobs (JobsK). JobsK is continuous.
County indicates from which county the observations come. Each county has up to 5 measurements, and this is why we need the mixed model—to account for the inherent correlation among the multiple observations from the same county. County is categorical.
Time indicates number of decades since 1960, and ranges from 0 to 4. Treated as continuous.
And Rural is an indicator (aka dummy) variable for whether the county is rural. Rural is categorical.
MIXED JobsK BY Rural WITH Time
/FIXED =Rural Time Rural*Time
/RANDOM Intercept Time|Subject(COUNTY) covtype(UN).
Class rural county;
Random int time/subject=county type=un;
>model<-lme(JobsK~rural*time, random=~time|County,data=countylong, na.action=na.omit)
mixed JobsK c.Time##Rural||County:Time,variance reml cov(un)
You can see here that Time is listed in the fixed portion of the model, which appears in SPSS’s Fixed statement, SAS’s model statement, before the || in Stata, and before the comma in R.
And it’s also listed in the random portion, which appears in SPSS’s and SAS’s Random statement, after the || in Stata, and after the comma in R.
It looks like we’re treating Time as both fixed and random. If we’re not, then what the heck are we doing?
The fixed portion is doing exactly what a linear model does. It fits an overall regression line over time. Since we have both Rural and a Rural*Time interaction, it actually fits two regression lines—one for the rural counties and one for the metropolitan counties. The coefficient we get for Rural measures the difference in their intercepts and the coefficient for the interaction measures the difference in their slopes.
Just to emphasize: This fixed effect for time measures the overall effect for time across all counties. It’s often called the population average effect, because it’s an estimate of the effect of time for the population of all counties.
Okay, so what is that random effect of time? Aren’t we making Time random as well as fixed?
As I said earlier, no.
A key part of the random statement is the identification of the Subject. In this example, it’s County. It’s really County that is a random factor in the model and we’re specifying two random effects for those Counties—an intercept and a slope over Time.
The random slope for Time at the County level means that the slope across time varies across Counties. In other words, the effect of Time on Jobs (the slope) is different for different values of County.
If you are thinking that it sounds like we’re really fitting an interaction between Time and County, then you would be correct. We are.
Because this slope is a random effect, we don’t measure this interaction through a regression coefficient as we would if it were fixed.
Instead, we measure how much each County’s slope differs from the population average slope, then find the variance of these difference measures. That’s the variance estimate for the random slope.
If that variance comes out to 0, it indicates that the slope of Time on Jobs is actually the same for all counties—they don’t vary from each other.
Now of course, we’re not doing these steps directly. But that is basically what the model is doing, through a lot of complicated statistical algorithms.
So, to reiterate the central point: Time in the fixed statement measures the overall effect of time on jobs across all counties. Time in the random statement measures the variance in the effects of time on jobs across counties. It looks the same in the syntax, but it’s actually a very different concept.