Generalized linear models, linear mixed models, generalized linear mixed models, marginal models, GEE models. You’ve probably heard of more than one of them and you’ve probably also heard that each one is an extension of our old friend, the general linear model.

This is true, and they extend our old friend in different ways, particularly in regard to the measurement level of the dependent variable and the independence of the measurements. So while the names are similar (and confusing), the distinctions are important.

It’s important to note here that I am glossing over many, many details in order to give you a basic overview of some important distinctions. These are complicated models, but I hope this overview gives you a starting place from which to explore more.

**General Linear Models**

The general linear model has this basic form:

And has these assumptions (among others)

- the residuals are independent of each other
- the residuals are normally distributed
- the relationship between Y and the model parameters is linear

So let’s see how some of the different model types extend this model in different ways.

**Generalized linear models**

Generalized linear models extend the last two assumptions. They generalize the possible distributions of the residuals to a family of distributions called the exponential family. This family includes the normal as well as the binomial, Poisson, negative binomial, and gamma distributions, among others. You are probably familiar with common examples like logistic, Poisson, and probit models.

When you change the distribution of the residuals, it turns out that the relationship between Y and the model parameters is no longer linear. However, for each distribution in the exponential family, there exists at least one function of the mean of Y whose relationship with the model parameters is linear. This function is called the link function.

The link function you choose will depend on which distribution you are choosing for the outcome variable. For example, a binomial residual can use a probit or a logit link function. A Poisson residual uses a log link function.

**Marginal Models**

Marginal models are a type of linear model that accounts for repeated response measures on the same subject. They extend the general linear model by allowing and accounting for non-independence among the observations of a single subject.

They do this by estimating one or more parameters that capture the covariance among the residuals. So rather than having a single constant variance and zero covariance for all residuals, observations from the same subject are allowed to have different variances and nonzero covariances. The pattern of variances and covariances is known as the covariance structure of the R matrix.

They still assume that observations from different subjects are independent, and linear marginal models still assume residuals are normally distributed.

**GEE Models**

Generalized estimating equation models are generalized linear marginal models. That is, they combine the generalized linear model for a non-normal residual with the repeated measures of a marginal model. You would use these when you have repeated measures on each subject and need to run a logistic, multinomial, Poisson or other generalized linear regression model.

**Linear Mixed Models**

Like marginal models, linear mixed models account for non-independence among clustered observations, but they do it in a different way.

Instead of estimating nonzero correlations among residuals, linear mixed models account for the fact that clustered observations are similar by estimating the variance among cluster means and among observations within a cluster. It literally partitions the variance in Y into cluster-level and observation-level parts.

Because of the way they account for variation among subjects, linear mixed models are much more flexible than marginal models.

For example, they can accommodate three levels of repeat or clustering, like repeated measurements on patients clustered within hospitals, and can be used to estimate more precise subject effects beyond variation among means.

They can accomplish these feats because they include parameters to measure the random effects of the clusters–by treating the variation among clusters as another sort of residual variation. The Mixed in the name comes from the fact that they estimate both fixed and random effects.

Like all linear models, linear mixed models assume residuals are normally distributed and the relationship between Y and the model parameters is linear.

**Generalized Linear Mixed Models**

You probably know by now where this one is going.

Generalized Linear Mixed Models are mixed models in which the residuals follow a distribution from the same exponential family. They require the same link functions as generalized linear models*and* at least one random effect.

Both generalized linear models and linear mixed models can be computationally intensive, especially as the number of random effects to be estimated goes beyond one or two. Putting them together can be especially so. I’ve run GLMMs that took hours to run on not very large data sets. They require special care and should not be undertaken lightly.

**How all the models are the same**

I’ve focused on how these models differ, but they also have underlying similarities.

- The structure is the same: they all are models of the relationship between a single response variable Y, and one or more predictor variables X. The variation around the model is estimated in the residual.
- They generally all use some form of maximum likelihood estimation. Even OLS estimation, used in the general linear model, is a special case of maximum likelihood.
- Fixed effects work the same in all these models. The function of Y may differ, and the residual structure may differ, but the X variables work the same in every one of these models. Dummy and effect coding, continuous predictors, interactions, quadratic terms have the same inherent meaning and can be used in any of these models.
- The General Linear Model is a subset of each of these other models. You could, if you really wanted to, run a GLM model in a software procedure designed for any of these other models by choosing the right options. The reverse is not true.

If you want to learn more about linear mixed models, check out the recording of my Random Intercept and Random Slope Models webinar. These two models are the basic building blocks of all mixed models.

**Get it all here**. It’s free.

———–

{ 3 comments… read them below or add one }

I have case-control data with non-parametric noise. I have been able to find signal using GEE denoting each subject as a cluster (as is standard because it is built on marginal models). Conditional logistic regression appears standard for case-control studies but I couldn’t find a clogit approach that used population averaging.

Could it be okay to cluster on case-control groupings in a GEE model even though it is built on a marginal model? This would increase within-cluster correlation and decrease between-cluster correlation. (assumptions of “cluster data” as described in the R::geeglm packages paper)

Dear Karen,

Thanks a lot for this very helpful website!

At the moment, I am looking for an adequate model for some repeated measure data and think that GEE or GENLINMIXED would be the best fitting.

My dependent variable is a 0,1 dummy variable called “attribute” (y/n) and there are some independent variables such as gender, profession (nominal variable with 4 possible values), age (at the moment of the interview), and another dummy variable focussing on prices (y/n that are awarded every year).

The repeated measurements took place various times a year (or, award period if you like) but are unfortunately not identically distributed within the years.

Now I thought of an SPSS syntax like this:

GENLIN attitude (REFERENCE=FIRST) BY profession gender award award_period WITH age_in_days

/MODEL profession gender age_in_days award INTERCEPT=YES DISTRIBUTION=BINOMIAL LINK=LOGIT

/REPEATED SUBJECT=respondent_ID WITHINSUBJECT=interview_number.

Interview number is uniquely identifying all interviews (repeated measurements), respondent_ID uniquely identifies all interviewees. The data are in the long format, so each respondent x measurement combination is one case.

But I am not sure whether this is really the best way to start with. By the way, we assume that there might be some profession and year specific “clustering” effects.

I am looking forward to any helpful comment!

All the best,

Pauline

Priceless ! Thanx for these clear explanations. This blog rules !

{ 1 trackback }