I am reviewing your notes from your workshop on assumptions. You have made it very clear how to analyze normality for regressions, but I could not find how to determine normality for ANOVAs. Do I check for normality for each independent variable separately? Where do I get the residuals? What plots do I run? Thank you!
I received this great question this morning from a past participant in my Assumptions of Linear Models workshop.
It’s one of those quick questions without a quick answer. Or rather, without a quick and useful answer. The quick answer is:
Do it exactly the same way. All of it.
The longer, useful answer is this:The assumptions are exactly the same for ANOVA and regression models. The normality assumption is that residuals follow a normal distribution. You usually see it like this:
ε~ i.i.d. N(0, σ²)
But what it’s really getting at is the distribution of Y|X. That’s Y given the value of X. Because X values are considered fixed, they have no distributions. Residuals have the same distribution as Y|X. If residuals are normally distributed, it means that Y is normally distributed within a value of X (not necessarily overall).
The only difference between the models is that ANOVAs generally have only categorical predictor variables, whereas regressions tend to have mostly continuous ones. So while the assumption is the same, it plays out differently.
When predictors are continuous, it’s impossible to check for normality of Y separately for each individual value of X. There are too many values of X and there is usually only one observation at each value of X. So you have to use the residuals to check normality.
But when predictors are categorical, there are usually just a few values of X (the categories), and there are many observations at each value of X. So you’ll often see the normality assumption for an ANOVA stated as:
“The distribution of Y within each group is normally distributed.” It’s the same thing as Y|X and in this context, it’s the same as saying the residuals are normally distributed.
The concept of a residual seems strange in an ANOVA, and often in that context, you’ll hear them called “errors” instead of “residuals.” But they’re the same thing. It’s the distance between the actual value of Y and the mean value of Y for a specific value of X. Those distances have the same distribution as the Ys within that group.
So in ANOVA, you actually have two options for testing normality. If there really are many values of Y for each value of X (each group), and there really are only a few groups (say, four or fewer), go ahead and check normality separately for each group.
But if you have many groups (a 2x2x3 ANOVA has 12 groups) or if there are few observations per group (it’s hard to check normality on only 20 data points), it’s often easier to just use the residuals and check them all together.
If you have a continuous covariate in the model as well, you’ve just lost option one, and residuals are the only way to go.
All GLM procedures have an option to save residuals. Once you do, run the same QQ plots to check normality as you would in regression.