Here’s a little reminder for those of you checking assumptions in regression and ANOVA:

The assumptions of normality and homogeneity of variance for linear models are **not** about Y, the dependent variable. (If you think I’m either stupid, crazy, or just plain nit-picking, read on. This distinction really is important).

The distributional assumptions for linear regression and ANOVA are for the distribution of Y|X — that’s Y given X. You have to take **out** the effects of all the Xs **before** you look at the distribution of Y. As it turns out, the distribution of Y|X is, by definition, the same as the distribution of the residuals. So the easiest way to check the distribution of Y|X is to save your residuals and check their distribution.

I’ve seen too many researchers drive themselves crazy trying to transform skewed Y distributions before they’ve even run the model. The distribution of the dependent variable **can** tell you what the distribution of the residuals** is not**—you just can’t get normal residuals from a binary dependent variable.

But it cannot always tell what the distribution of the residuals **is**.

If a categorical independent variable had a big effect, the dependent variable would have a continuous, bimodal distribution. But the residuals (or the distribution within each category of the independent variable) would be normally distributed.

And what are those distributional assumptions of Y|X?

1. Independence

2. Normality

3. Constant Variance

You can check all three with a few residual plots–a Q-Q plot of the residuals for normality, and a scatter plot of Residuals on X or Predicted values of Y to check 1 and 3.

________________________________________________________

{ 24 comments… read them below or add one }

This explanation is super helpful, thank you!

I have three treatments and 2 timepoints. I have performed a Mixed model and saved the residuals. When testing for normally (using the explore command) should i include treatment in the factor list in order to make Q-Q plots for each group? are analyze it all as one?

and what about time?

Hello Karen,

When you say account for all Xs, do we also include the control variables (in addition to the predictors)?

Thank you!

Yes. ALL Xs, both control and predictors.

Hi Karen,

Thank you very much, just to know that I´ve got the concepts right is a relief actually!!! I´ll keep looking=)

Hi Karen,

I´m struggling, trying to find any information about how to test the assumptions for a type II regression (MA). If I understood well, you check for normality by analyzing the residuals of Y, because you assume that X have no random error, which is appropriate for simple linear regression OLS. However, when performing a type II regression, we assume that X also have an associated error….so how can I test the assumptions (and are they the same) in those MA regressions?

thank you very much

ps: sorry about my english, I´m brazilian=)

Hi Bianca,

This is a great question, but I don’t have the answer. Hopefully another reader can comment. I know Type II regression well enough to say you’ve got the concepts right and I agree it makes sense that Xs also have associated error, but I can’t verify it.

Hi Karen,

So what it means is to check the assumption by using the residuals generated from the model instead of the dependent variable itself? If I am running a Linear Mixed Model in SPSS, is there anyway to check homogeneity of variance (not set as default as in univariate)? And should Levene test be used on the residuals to check for homogeneity of variance?

Hi Oriole,

Yes, exactly. Save the residuals and do your assumption checks on them, not Y.

A Linear Mixed Model in SPSS can save the residuals and then you do everything the same as you would in any linear model for checking assumptions. I don’t use Levene test as a general rule for homogeneity of variance as it is unreliable.

Hi anyone,

I am not sure why the assumptions of anova and linear regression are same. Can anyone explain me in details?

Normality, equal variances and independence

Hi Karen – thanks for the article. This can be a confusing topic. Say I have a categorical variable with three levels (e.g. country) and I am using it to predict income. After using a General Linear Model to get residuals I check to see if they were normally distributed using a Shapiro-Wilk test. As a whole the residuals were normally distributed but when I break the residuals down into the levels of each category (the residuals of the predictions for each country) then only two of the three countries have normally distributed residuals. Does this mean that the assumption of normally distributed residuals has been broken? Or is it okay since the overall residuals of the model are normal?

Hi Karen,

1. Do all statistical packages (eg. SPSS) also assume this residual consideration for their normality check tests? I mean, when we enter dv raw scores in the Explore menu for normality test, does SPSS’s algorithm intelligently use and compute residuals to return the normality test?

2. Suppose we have a 2 x 2 factorial Anova as an example, how can one check normality assumption for the residuals? Should we take the residuals number from each cell (to comply with Y|X) or from the overall residuals regardless the factor. SPSS allows us to apply both (a field Factor in explore menu)

I have a hunch that we have to generate/calculate residuals manually before doing the normality test, but still unsure about it.

“You have to take out the effects of all the Xs before you look at the distribution of Y. As it turns out, the distribution of Y|X is, by definition, the same as the distribution of the residuals.”

This is a bit confusing. How do you take “out” effects of all the xs in the context above? Just wanted to know the mechanism, an example with a some data points would definitely help. Also, what leads us to believe that the distribution of Y is same as the distribution of the residuals?

Sorry, I mean to say, what leads us to believe that the distribution of Y|X is same as the distribution of the residuals?

Because all X’s are assumed fixed. In other words, they are assumed to have no random error. So when you add the Xs to the residuals, you’re just adding constants (at least theoretically).

It’s very hard to think about without writing out the equations, but that’s the gist of it. The easiest example would be one in which there was only one X, which had only two values.

I may have to write out a separate blog post, with pictures, to show you.

Given any dependent variable, how would you choose what transformation (if any) of this variable you would want to regress against with regards to the normality assumption. It sounds like from this brief explanation that there is no way to do that.

Hi Joram, there is. Sometimes you can do it with logic. Eg. I need a function that will affect high numbers more than low numbers for a right-skewed distribution. Logs and square roots both do that. Another option is to use the Box-Cox transformation, which will give you an idea of the most effective power transformations.

Thanks for the notes up there. You are right, many researchers (including me) drive ourselves crazy for trying to test normality & others on DV. But that’s what we have been taught by either our stats teachers or stats books. Thanks for the enlightment 🙂

HI Nur,

You’re welcome. I used to teach stats, and sometimes there are just too many new concepts you’re throwing at students to really clarify the difference. So I’m sure at the time, that was the best way to teach it. But now you’re sophisticated enough to stop driving yourself crazy. 🙂

Karen

What do you do if a review of the residuals revealed not normal characteristics? For example, I’ working on a General Linear Model analysis (dependent variables are nominal and random) and the distribution of residuals look sigmoidal (like an S). The residual vs observation order plot shows a few spikes.

Hi Sonia,

The dependent variable is nominal? You need a logistic regression then instead of a GLM. The sigmoidal residuals are exactly what happens. Here is another article that might clarify:

When Dependent Variables Are Not Fit for GLM, Now What?Is there actually an order to the observations? Unless you have time-series or spatial data, there usually isn’t. Those are the situations where autocorrelation comes in.

I have a question about the assumptions of linear regression and ANOVA – what are the differences in the assumptions behind these models ?

Hi Joe–there are no differences in the assumptions. ANOVA and Regression are really just two forms of the same theoretical model.

Now since the assumptions are about Y given X (Y|X), and the X’s usually have a different form in the two models, they do manifest slightly differently. For example, if you look at two very simple models–a one way anova and a simple regression with a single continuous predictor–the X is categorical in the former and continuous in the latter.

That means that in the ANOVA, the assumptions about Y|X being independent with normal distribution and constant variance means apply to the values of Y within each level of X.

In the regression, since X is continuous, it’s hard to look at the distribution of Y at EACH value of X (it’s impossible, actually, theoretically). So although the assumption is the same, it’s easier to check it by looking at the residuals, which have the same distribution as Y|X.

Sorry for my stupid, it is assumed that the analysis of distribution of Y|X in the context of linear model means that we need to check _residuals_ of Y at each level of our factor?

{ 1 trackback }