Generalized linear models, linear mixed models, generalized linear mixed models, marginal models, GEE models. You’ve probably heard of more than one of them and you’ve probably also heard that each one is an extension of our old friend, the general linear model.

This is true, and they extend our old friend in different ways, particularly in regard to the measurement level of the dependent variable and the independence of the measurements. So while the names are similar (and confusing), the distinctions are important.

It’s important to note here that I am glossing over many, many details in order to give you a basic overview of some important distinctions. These are complicated models, but I hope this overview gives you a starting place from which to explore more.

**General Linear Models**

The general linear model has this basic form:

**Y _{i} = β_{0} + β_{1}X_{1} +β_{2}X_{2 }+ ε_{i}**

**ε**_{i ~ iid N(0, σ²)
}

And has these assumptions (among others)

- the residuals are independent of each other
- the residuals are normally distributed
- the relationship between Y and the model parameters is linear

So let’s see how some of the different model types extend this model in different ways.

**Generalized linear models**

Generalized linear models extend the last two assumptions. They generalize the possible distributions of the residuals to a family of distributions called the exponential family. This family includes the normal as well as the binomial, Poisson, beta, and gamma distributions, among others.

You are probably familiar with common examples like logistic, Poisson, and probit models.

When you change the distribution of Y|X, it turns out that the relationship between Y and the model parameters is no longer linear. However, for each distribution in the exponential family, there exists at least one function of the mean of Y whose relationship with the model parameters is linear. This function is called the link function.

**f(μ _{Y|X}) = β_{0} + β_{1}X_{1} +β_{2}X_{2}**

The link function you choose will depend on which conditional distribution you are choosing for the outcome variable. For example, a binomial distributed variable can use a probit or a logit link function. A Poisson distributed variable usually uses a log link function.

**Marginal Models**

Marginal models are a type of linear model that accounts for repeated response measures on the same subject. They extend the general linear model by allowing and accounting for non-independence among the observations of a single subject.

They do this by estimating one or more parameters that capture the covariance among the residuals.

So rather than assuming a single constant variance and zero covariance for all residuals, observations from the same subject are allowed to have different variances and nonzero covariances. The pattern of variances and covariances is known as the covariance structure of the R matrix.

They still assume that observations from different subjects are independent, and linear marginal models still assume residuals are normally distributed.

**GEE Models**

Generalized estimating equation models are generalized linear marginal models. That is, they combine the generalized linear model for a non-normal residual with the repeated measures of a marginal model. You would use these when you have repeated measures on each subject and need to run a logistic, multinomial, Poisson or other generalized linear regression model.

**Linear Mixed Models**

Like marginal models, linear mixed models account for non-independence among clustered observations, but they do it in a different way.

Instead of estimating nonzero correlations among residuals, linear mixed models account for the fact that clustered observations are similar by estimating the variance among cluster means. It literally partitions the variance in Y into cluster-level and observation-level parts.

Because of the way they account for variation among subjects, linear mixed models are much more flexible than marginal models.

For example, they can accommodate three levels of repeat or clustering, like repeated measurements on patients clustered within hospitals, and can be used to estimate more precise subject effects beyond variation among means.

They can accomplish these feats because they include parameters to measure the random effects of the clusters–by treating the variation among clusters as another sort of residual variation. The Mixed in the name comes from the fact that they estimate both fixed and random effects.

Like all linear models, linear mixed models assume residuals are normally distributed and the relationship between Y and the model parameters is linear.

**Generalized Linear Mixed Models**

You probably know by now where this one is going.

Generalized Linear Mixed Models are mixed models in which the residuals follow a distribution from the same exponential family. They require the same link functions as generalized linear models *and* at least one random effect.

Both generalized linear models and linear mixed models can be computationally intensive, especially as the number of random effects to be estimated goes beyond one or two. Putting them together can be especially so. I’ve run GLMMs that took hours to run on not very large data sets. They require special care and should not be undertaken lightly.

**How all the models are the same**

I’ve focused on how these models differ, but they also have underlying similarities.

- The structure is the same: they all are models of the relationship between a single response variable Y, and one or more predictor variables X. The variation around the model is estimated in the residual.
- They generally all use some form of maximum likelihood estimation. Even OLS estimation, used in the general linear model, is a special case of maximum likelihood.
- Fixed effects work the same in all these models. The function of Y may differ, and the residual structure may differ, but the X variables work the same in every one of these models. Dummy and effect coding, continuous predictors, interactions, quadratic terms have the same inherent meaning and can be used in any of these models.
- The General Linear Model is a subset of each of these other models. You could, if you really wanted to, run a GLM model in a software procedure designed for any of these other models by choosing the right options. The reverse is not true.

{ 10 comments… read them below or add one }

Hello Karen,

First of all thanks so much for sharing this kind of great information regarding “Extensions of the General Linear Model”.

But I have one question, How to check distributions while using Generalized Linear Mixed Models?

On which basis I get to know that the models showed normal or lognormal or binomial or Poisson or negative binomial or gamma?

Could you please help me out with this issue?

Thanks in advance.

Best,

Vijay

Hi Vijay,

It mostly comes from knowing what kind of response variable you have. You may find these helpful:

https://www.theanalysisfactor.com/when-dependent-variables-are-not-fit-for-glm-now-what/

https://www.theanalysisfactor.com/dependent-variables-never-meet-normality/

Hi Karen,

Firstly, a big thank you!! I found “General Linear Model and its extensions” mentioned somewhere and finally I get what it means, it’s exactly what you have explained.

However, I’ve been wondering if there is really a standard definition of general linear model. The resource that I was following (Univ of Texas Arlington – http://www.uta.edu/faculty/sawasthi/Statistics/stglm.html#reg_extension) and even Wikipedia describes general l.m. to have multiple dependent variables. UTA also says that dep. variables can be correlated to each other, independents can be correlated to each other. So that way multivariate regression and repeated measure regression are subsets of general l.m. and multi-co-linearity is already resolved by generalized inverse. That way there’s hardly any extensions possible over general linear model.

If you have any links to global standards for general linear model definition kindly share.

Thanks!

Dipanjan

Hi Dipanjan,

I’m not sure I have any links, but yes, the general linear model can have mulitple DVs.

Technically, anova and regression are subsets in which the Y matrix is a vector–one DV.

I wonder if the models described are extensions of the linear regression model rather than extensions of the general linear model because none of the extensions involves multivariate data (i.e., multiple dependent variables).

I have case-control data with non-parametric noise. I have been able to find signal using GEE denoting each subject as a cluster (as is standard because it is built on marginal models). Conditional logistic regression appears standard for case-control studies but I couldn’t find a clogit approach that used population averaging.

Could it be okay to cluster on case-control groupings in a GEE model even though it is built on a marginal model? This would increase within-cluster correlation and decrease between-cluster correlation. (assumptions of “cluster data” as described in the R::geeglm packages paper)

Dear Karen,

Thanks a lot for this very helpful website!

At the moment, I am looking for an adequate model for some repeated measure data and think that GEE or GENLINMIXED would be the best fitting.

My dependent variable is a 0,1 dummy variable called “attribute” (y/n) and there are some independent variables such as gender, profession (nominal variable with 4 possible values), age (at the moment of the interview), and another dummy variable focussing on prices (y/n that are awarded every year).

The repeated measurements took place various times a year (or, award period if you like) but are unfortunately not identically distributed within the years.

Now I thought of an SPSS syntax like this:

GENLIN attitude (REFERENCE=FIRST) BY profession gender award award_period WITH age_in_days

/MODEL profession gender age_in_days award INTERCEPT=YES DISTRIBUTION=BINOMIAL LINK=LOGIT

/REPEATED SUBJECT=respondent_ID WITHINSUBJECT=interview_number.

Interview number is uniquely identifying all interviews (repeated measurements), respondent_ID uniquely identifies all interviewees. The data are in the long format, so each respondent x measurement combination is one case.

But I am not sure whether this is really the best way to start with. By the way, we assume that there might be some profession and year specific “clustering” effects.

I am looking forward to any helpful comment!

All the best,

Pauline

Hi Pauline,

As a general rule, yes you’ll need either GEE or GLMM for repeated measures study on a binary outcome. But these are very complicated models and I wouldn’t feel comfortable giving advice on a specific analysis without knowing ALL the details (it’s all about the details!).

Priceless ! Thanx for these clear explanations. This blog rules !

Amen!!

{ 1 trackback }