Checking Assumptions in ANOVA and Linear Regression Models: The Distribution of Dependent Variables

Here’s a little reminder for those of you checking assumptions in regression and ANOVA:

The assumptions of normality and homogeneity of variance for linear models are not about Y, the dependent variable. (If you think I’m either stupid, crazy, or just plain nit-picking, read on. This distinction really is important).

The distributional assumptions for linear regression and ANOVA are for the distribution of Y|X — that’s Y given X. You have to take out the effects of all the Xs before you look at the distribution of Y. As it turns out, the distribution of Y|X is, by definition, the same as the distribution of the residuals. So the easiest way to check the distribution of Y|X is to save your residuals and check their distribution.

I’ve seen too many researchers drive themselves crazy trying to transform skewed Y distributions before they’ve even run the model. The distribution of the dependent variable can tell you what the distribution of the residuals is not—you just can’t get normal residuals from a binary dependent variable.

But it cannot always tell what the distribution of the residuals is.

If a categorical independent variable had a big effect, the dependent variable would have a continuous, bimodal distribution. But the residuals (or the distribution within each category of the independent variable) would be normally distributed.

And what are those distributional assumptions of Y|X?

1. Independence

2. Normality

3. Constant Variance

You can check all three with a few residual plots–a Q-Q plot of the residuals for normality, and a scatter plot of Residuals on X or Predicted values of Y to check 1 and 3.

Four Critical Steps in Building Linear Regression Models

While you’re worrying about which predictors to enter, you might be missing issues that have a big impact your analysis. This training will help you achieve more accurate results and a less-frustrating model building experience.

Comments

Loretta Rafay says

April 3, 2018 at 7:25 pm

This explanation is super helpful, thank you!

Reply
Andre says

March 28, 2016 at 9:01 pm

I have three treatments and 2 timepoints. I have performed a Mixed model and saved the residuals. When testing for normally (using the explore command) should i include treatment in the factor list in order to make Q-Q plots for each group? are analyze it all as one?

and what about time?

Reply
Mano says

December 25, 2014 at 4:20 pm

Hello Karen,

When you say account for all Xs, do we also include the control variables (in addition to the predictors)?

Thank you!

Reply
- Karen says
  
  December 29, 2014 at 2:04 pm
  
  Yes. ALL Xs, both control and predictors.
  
  Reply
Bianca says

November 24, 2014 at 10:01 am

Hi Karen,

Thank you very much, just to know that I´ve got the concepts right is a relief actually!!! I´ll keep looking=)

Reply
Bianca says

November 22, 2014 at 4:57 pm

Hi Karen,

I´m struggling, trying to find any information about how to test the assumptions for a type II regression (MA). If I understood well, you check for normality by analyzing the residuals of Y, because you assume that X have no random error, which is appropriate for simple linear regression OLS. However, when performing a type II regression, we assume that X also have an associated error….so how can I test the assumptions (and are they the same) in those MA regressions?

thank you very much

ps: sorry about my english, I´m brazilian=)

Reply
- Karen says
  
  November 23, 2014 at 1:03 pm
  
  Hi Bianca,
  
  This is a great question, but I don’t have the answer. Hopefully another reader can comment. I know Type II regression well enough to say you’ve got the concepts right and I agree it makes sense that Xs also have associated error, but I can’t verify it.
  
  Reply
Oriole says

October 18, 2014 at 11:08 pm

Hi Karen,
So what it means is to check the assumption by using the residuals generated from the model instead of the dependent variable itself? If I am running a Linear Mixed Model in SPSS, is there anyway to check homogeneity of variance (not set as default as in univariate)? And should Levene test be used on the residuals to check for homogeneity of variance?

Reply
- Karen says
  
  October 20, 2014 at 9:25 am
  
  Hi Oriole,
  
  Yes, exactly. Save the residuals and do your assumption checks on them, not Y.
  
  A Linear Mixed Model in SPSS can save the residuals and then you do everything the same as you would in any linear model for checking assumptions. I don’t use Levene test as a general rule for homogeneity of variance as it is unreliable.
  
  Reply
heather says

September 26, 2014 at 12:07 am

Hi anyone,

I am not sure why the assumptions of anova and linear regression are same. Can anyone explain me in details?
Normality, equal variances and independence

Reply
Arran Davis says

April 10, 2014 at 3:54 pm

Hi Karen – thanks for the article. This can be a confusing topic. Say I have a categorical variable with three levels (e.g. country) and I am using it to predict income. After using a General Linear Model to get residuals I check to see if they were normally distributed using a Shapiro-Wilk test. As a whole the residuals were normally distributed but when I break the residuals down into the levels of each category (the residuals of the predictions for each country) then only two of the three countries have normally distributed residuals. Does this mean that the assumption of normally distributed residuals has been broken? Or is it okay since the overall residuals of the model are normal?

Reply
hamzah says

February 23, 2014 at 8:16 am

Hi Karen,
1. Do all statistical packages (eg. SPSS) also assume this residual consideration for their normality check tests? I mean, when we enter dv raw scores in the Explore menu for normality test, does SPSS’s algorithm intelligently use and compute residuals to return the normality test?

2. Suppose we have a 2 x 2 factorial Anova as an example, how can one check normality assumption for the residuals? Should we take the residuals number from each cell (to comply with Y|X) or from the overall residuals regardless the factor. SPSS allows us to apply both (a field Factor in explore menu)

I have a hunch that we have to generate/calculate residuals manually before doing the normality test, but still unsure about it.

Reply
rb says

November 10, 2013 at 4:32 pm

“You have to take out the effects of all the Xs before you look at the distribution of Y. As it turns out, the distribution of Y|X is, by definition, the same as the distribution of the residuals.”
This is a bit confusing. How do you take “out” effects of all the xs in the context above? Just wanted to know the mechanism, an example with a some data points would definitely help. Also, what leads us to believe that the distribution of Y is same as the distribution of the residuals?

Reply
- rb says
  
  November 10, 2013 at 4:33 pm
  
  Sorry, I mean to say, what leads us to believe that the distribution of Y|X is same as the distribution of the residuals?
  
  Reply
  - Karen says
    
    November 11, 2013 at 3:16 pm
    
    Because all X’s are assumed fixed. In other words, they are assumed to have no random error. So when you add the Xs to the residuals, you’re just adding constants (at least theoretically).
    
    It’s very hard to think about without writing out the equations, but that’s the gist of it. The easiest example would be one in which there was only one X, which had only two values.
    
    I may have to write out a separate blog post, with pictures, to show you.
    
    Reply
Joram says

September 18, 2013 at 3:48 pm

Given any dependent variable, how would you choose what transformation (if any) of this variable you would want to regress against with regards to the normality assumption. It sounds like from this brief explanation that there is no way to do that.

Reply
- Karen says
  
  September 25, 2013 at 10:27 am
  
  Hi Joram, there is. Sometimes you can do it with logic. Eg. I need a function that will affect high numbers more than low numbers for a right-skewed distribution. Logs and square roots both do that. Another option is to use the Box-Cox transformation, which will give you an idea of the most effective power transformations.
  
  Reply
Nur Barizah says

July 20, 2012 at 12:50 am

Thanks for the notes up there. You are right, many researchers (including me) drive ourselves crazy for trying to test normality & others on DV. But that’s what we have been taught by either our stats teachers or stats books. Thanks for the enlightment 🙂

Reply
- Karen says
  
  August 3, 2012 at 2:44 pm
  
  HI Nur,
  
  You’re welcome. I used to teach stats, and sometimes there are just too many new concepts you’re throwing at students to really clarify the difference. So I’m sure at the time, that was the best way to teach it. But now you’re sophisticated enough to stop driving yourself crazy. 🙂
  
  Karen
  
  Reply
Stat Rules says

July 16, 2010 at 1:28 pm

What do you do if a review of the residuals revealed not normal characteristics? For example, I’ working on a General Linear Model analysis (dependent variables are nominal and random) and the distribution of residuals look sigmoidal (like an S). The residual vs observation order plot shows a few spikes.

Reply
- Karen says
  
  July 16, 2010 at 1:50 pm
  
  Hi Sonia,
  
  The dependent variable is nominal? You need a logistic regression then instead of a GLM. The sigmoidal residuals are exactly what happens. Here is another article that might clarify: When Dependent Variables Are Not Fit for GLM, Now What?
  
  Is there actually an order to the observations? Unless you have time-series or spatial data, there usually isn’t. Those are the situations where autocorrelation comes in.
  
  Reply
Joe King says

June 3, 2009 at 1:10 am

I have a question about the assumptions of linear regression and ANOVA – what are the differences in the assumptions behind these models ?

Reply
- admin says
  
  June 3, 2009 at 10:40 pm
  
  Hi Joe–there are no differences in the assumptions. ANOVA and Regression are really just two forms of the same theoretical model.
  
  Now since the assumptions are about Y given X (Y|X), and the X’s usually have a different form in the two models, they do manifest slightly differently. For example, if you look at two very simple models–a one way anova and a simple regression with a single continuous predictor–the X is categorical in the former and continuous in the latter.
  
  That means that in the ANOVA, the assumptions about Y|X being independent with normal distribution and constant variance means apply to the values of Y within each level of X.
  
  In the regression, since X is continuous, it’s hard to look at the distribution of Y at EACH value of X (it’s impossible, actually, theoretically). So although the assumption is the same, it’s easier to check it by looking at the residuals, which have the same distribution as Y|X.
  
  Reply
  - stan says
    
    August 23, 2015 at 5:00 am
    
    Sorry for my stupid, it is assumed that the analysis of distribution of Y|X in the context of linear model means that we need to check _residuals_ of Y at each level of our factor?
    
    Reply

Reader Interactions

Comments

Leave a Reply Cancel reply