• Skip to primary navigation
  • Skip to main content
  • Skip to primary sidebar
The Analysis Factor

The Analysis Factor

Statistical Consulting, Resources, and Statistics Workshops for Researchers

  • Home
  • About
    • Our Programs
    • Our Team
    • Our Core Values
    • Our Privacy Policy
    • Employment
    • Guest Instructors
  • Membership
    • Statistically Speaking Membership Program
    • Login
  • Workshops
    • Online Workshops
    • Login
  • Consulting
    • Statistical Consulting Services
    • Login
  • Free Webinars
  • Contact
  • Login

Checking Assumptions in ANOVA and Linear Regression Models: The Distribution of Dependent Variables

by Karen Grace-Martin 25 Comments

Here’s a little reminder for those of you checking assumptions in regression and ANOVA:

The assumptions of normality and homogeneity of variance for linear models are not about Y, the dependent variable.    (If you think I’m either stupid, crazy, or just plain nit-picking, read on.  This distinction really is important).

The distributional assumptions for linear regression and ANOVA are for the distribution of Y|X — that’s Y given X.  You have to take out the effects of all the Xs before you look at the distribution of Y.  As it turns out, the distribution of Y|X is, by definition, the same as the distribution of the residuals.  So the easiest way to check the distribution of Y|X is to save your residuals and check their distribution.

I’ve seen too many researchers drive themselves crazy trying to transform skewed Y distributions before they’ve even run the model.  The distribution of the dependent variable can tell you what the distribution of the residuals is not—you just can’t get normal residuals from a binary dependent variable.

But it cannot always tell what the distribution of the residuals is.

If a categorical independent variable had a big effect, the dependent variable would have a continuous, bimodal distribution.  But the residuals (or the distribution within each category of the independent variable) would be normally distributed.

And what are those distributional assumptions of Y|X?

1. Independence

2. Normality

3. Constant Variance

You can check all three with a few residual plots–a Q-Q plot of the residuals for normality, and a scatter plot of Residuals on X or Predicted values of Y to check 1 and 3.
________________________________________________________

tn_assum_lmLearn more about each of the assumptions of linear models–regression and ANOVA–so they make sense–in our new On Demand workshop: Assumptions of Linear Models.


Bookmark and Share

Tagged With: ANOVA, distribution, linear regression, Model Assumptions, normality, Q-Q plot, residuals

Related Posts

  • Member Training: The Multi-Faceted World of Residuals
  • Can Likert Scale Data ever be Continuous?
  • Same Statistical Models, Different (and Confusing) Output Terms
  • Why ANOVA is Really a Linear Regression, Despite the Difference in Notation

Reader Interactions

Comments

  1. Loretta Rafay says

    April 3, 2018 at 7:25 pm

    This explanation is super helpful, thank you!

    Reply
  2. Andre says

    March 28, 2016 at 9:01 pm

    I have three treatments and 2 timepoints. I have performed a Mixed model and saved the residuals. When testing for normally (using the explore command) should i include treatment in the factor list in order to make Q-Q plots for each group? are analyze it all as one?

    and what about time?

    Reply
  3. Mano says

    December 25, 2014 at 4:20 pm

    Hello Karen,

    When you say account for all Xs, do we also include the control variables (in addition to the predictors)?

    Thank you!

    Reply
    • Karen says

      December 29, 2014 at 2:04 pm

      Yes. ALL Xs, both control and predictors.

      Reply
  4. Bianca says

    November 24, 2014 at 10:01 am

    Hi Karen,

    Thank you very much, just to know that I´ve got the concepts right is a relief actually!!! I´ll keep looking=)

    Reply
  5. Bianca says

    November 22, 2014 at 4:57 pm

    Hi Karen,

    I´m struggling, trying to find any information about how to test the assumptions for a type II regression (MA). If I understood well, you check for normality by analyzing the residuals of Y, because you assume that X have no random error, which is appropriate for simple linear regression OLS. However, when performing a type II regression, we assume that X also have an associated error….so how can I test the assumptions (and are they the same) in those MA regressions?

    thank you very much

    ps: sorry about my english, I´m brazilian=)

    Reply
    • Karen says

      November 23, 2014 at 1:03 pm

      Hi Bianca,

      This is a great question, but I don’t have the answer. Hopefully another reader can comment. I know Type II regression well enough to say you’ve got the concepts right and I agree it makes sense that Xs also have associated error, but I can’t verify it.

      Reply
  6. Oriole says

    October 18, 2014 at 11:08 pm

    Hi Karen,
    So what it means is to check the assumption by using the residuals generated from the model instead of the dependent variable itself? If I am running a Linear Mixed Model in SPSS, is there anyway to check homogeneity of variance (not set as default as in univariate)? And should Levene test be used on the residuals to check for homogeneity of variance?

    Reply
    • Karen says

      October 20, 2014 at 9:25 am

      Hi Oriole,

      Yes, exactly. Save the residuals and do your assumption checks on them, not Y.

      A Linear Mixed Model in SPSS can save the residuals and then you do everything the same as you would in any linear model for checking assumptions. I don’t use Levene test as a general rule for homogeneity of variance as it is unreliable.

      Reply
  7. heather says

    September 26, 2014 at 12:07 am

    Hi anyone,

    I am not sure why the assumptions of anova and linear regression are same. Can anyone explain me in details?
    Normality, equal variances and independence

    Reply
  8. Arran Davis says

    April 10, 2014 at 3:54 pm

    Hi Karen – thanks for the article. This can be a confusing topic. Say I have a categorical variable with three levels (e.g. country) and I am using it to predict income. After using a General Linear Model to get residuals I check to see if they were normally distributed using a Shapiro-Wilk test. As a whole the residuals were normally distributed but when I break the residuals down into the levels of each category (the residuals of the predictions for each country) then only two of the three countries have normally distributed residuals. Does this mean that the assumption of normally distributed residuals has been broken? Or is it okay since the overall residuals of the model are normal?

    Reply
  9. hamzah says

    February 23, 2014 at 8:16 am

    Hi Karen,
    1. Do all statistical packages (eg. SPSS) also assume this residual consideration for their normality check tests? I mean, when we enter dv raw scores in the Explore menu for normality test, does SPSS’s algorithm intelligently use and compute residuals to return the normality test?

    2. Suppose we have a 2 x 2 factorial Anova as an example, how can one check normality assumption for the residuals? Should we take the residuals number from each cell (to comply with Y|X) or from the overall residuals regardless the factor. SPSS allows us to apply both (a field Factor in explore menu)

    I have a hunch that we have to generate/calculate residuals manually before doing the normality test, but still unsure about it.

    Reply
  10. rb says

    November 10, 2013 at 4:32 pm

    “You have to take out the effects of all the Xs before you look at the distribution of Y. As it turns out, the distribution of Y|X is, by definition, the same as the distribution of the residuals.”
    This is a bit confusing. How do you take “out” effects of all the xs in the context above? Just wanted to know the mechanism, an example with a some data points would definitely help. Also, what leads us to believe that the distribution of Y is same as the distribution of the residuals?

    Reply
    • rb says

      November 10, 2013 at 4:33 pm

      Sorry, I mean to say, what leads us to believe that the distribution of Y|X is same as the distribution of the residuals?

      Reply
      • Karen says

        November 11, 2013 at 3:16 pm

        Because all X’s are assumed fixed. In other words, they are assumed to have no random error. So when you add the Xs to the residuals, you’re just adding constants (at least theoretically).

        It’s very hard to think about without writing out the equations, but that’s the gist of it. The easiest example would be one in which there was only one X, which had only two values.

        I may have to write out a separate blog post, with pictures, to show you.

        Reply
  11. Joram says

    September 18, 2013 at 3:48 pm

    Given any dependent variable, how would you choose what transformation (if any) of this variable you would want to regress against with regards to the normality assumption. It sounds like from this brief explanation that there is no way to do that.

    Reply
    • Karen says

      September 25, 2013 at 10:27 am

      Hi Joram, there is. Sometimes you can do it with logic. Eg. I need a function that will affect high numbers more than low numbers for a right-skewed distribution. Logs and square roots both do that. Another option is to use the Box-Cox transformation, which will give you an idea of the most effective power transformations.

      Reply
  12. Nur Barizah says

    July 20, 2012 at 12:50 am

    Thanks for the notes up there. You are right, many researchers (including me) drive ourselves crazy for trying to test normality & others on DV. But that’s what we have been taught by either our stats teachers or stats books. Thanks for the enlightment 🙂

    Reply
    • Karen says

      August 3, 2012 at 2:44 pm

      HI Nur,

      You’re welcome. I used to teach stats, and sometimes there are just too many new concepts you’re throwing at students to really clarify the difference. So I’m sure at the time, that was the best way to teach it. But now you’re sophisticated enough to stop driving yourself crazy. 🙂

      Karen

      Reply
  13. Stat Rules says

    July 16, 2010 at 1:28 pm

    What do you do if a review of the residuals revealed not normal characteristics? For example, I’ working on a General Linear Model analysis (dependent variables are nominal and random) and the distribution of residuals look sigmoidal (like an S). The residual vs observation order plot shows a few spikes.

    Reply
    • Karen says

      July 16, 2010 at 1:50 pm

      Hi Sonia,

      The dependent variable is nominal? You need a logistic regression then instead of a GLM. The sigmoidal residuals are exactly what happens. Here is another article that might clarify: When Dependent Variables Are Not Fit for GLM, Now What?

      Is there actually an order to the observations? Unless you have time-series or spatial data, there usually isn’t. Those are the situations where autocorrelation comes in.

      Reply
  14. Joe King says

    June 3, 2009 at 1:10 am

    I have a question about the assumptions of linear regression and ANOVA – what are the differences in the assumptions behind these models ?

    Reply
    • admin says

      June 3, 2009 at 10:40 pm

      Hi Joe–there are no differences in the assumptions. ANOVA and Regression are really just two forms of the same theoretical model.

      Now since the assumptions are about Y given X (Y|X), and the X’s usually have a different form in the two models, they do manifest slightly differently. For example, if you look at two very simple models–a one way anova and a simple regression with a single continuous predictor–the X is categorical in the former and continuous in the latter.

      That means that in the ANOVA, the assumptions about Y|X being independent with normal distribution and constant variance means apply to the values of Y within each level of X.

      In the regression, since X is continuous, it’s hard to look at the distribution of Y at EACH value of X (it’s impossible, actually, theoretically). So although the assumption is the same, it’s easier to check it by looking at the residuals, which have the same distribution as Y|X.

      Reply
      • stan says

        August 23, 2015 at 5:00 am

        Sorry for my stupid, it is assumed that the analysis of distribution of Y|X in the context of linear model means that we need to check _residuals_ of Y at each level of our factor?

        Reply

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Please note that, due to the large number of comments submitted, any questions on problems related to a personal study/project will not be answered. We suggest joining Statistically Speaking, where you have access to a private forum and more resources 24/7.

Primary Sidebar

This Month’s Statistically Speaking Live Training

  • February Member Training: Choosing the Best Statistical Analysis

Upcoming Workshops

  • Logistic Regression for Binary, Ordinal, and Multinomial Outcomes (May 2021)
  • Introduction to Generalized Linear Mixed Models (May 2021)

Read Our Book



Data Analysis with SPSS
(4th Edition)

by Stephen Sweet and
Karen Grace-Martin

Statistical Resources by Topic

  • Fundamental Statistics
  • Effect Size Statistics, Power, and Sample Size Calculations
  • Analysis of Variance and Covariance
  • Linear Regression
  • Complex Surveys & Sampling
  • Count Regression Models
  • Logistic Regression
  • Missing Data
  • Mixed and Multilevel Models
  • Principal Component Analysis and Factor Analysis
  • Structural Equation Modeling
  • Survival Analysis and Event History Analysis
  • Data Analysis Practice and Skills
  • R
  • SPSS
  • Stata

Copyright © 2008–2021 The Analysis Factor, LLC. All rights reserved.
877-272-8096   Contact Us

The Analysis Factor uses cookies to ensure that we give you the best experience of our website. If you continue we assume that you consent to receive cookies on all websites from The Analysis Factor.
Continue Privacy Policy
Privacy & Cookies Policy

Privacy Overview

This website uses cookies to improve your experience while you navigate through the website. Out of these, the cookies that are categorized as necessary are stored on your browser as they are essential for the working of basic functionalities of the website. We also use third-party cookies that help us analyze and understand how you use this website. These cookies will be stored in your browser only with your consent. You also have the option to opt-out of these cookies. But opting out of some of these cookies may affect your browsing experience.
Necessary
Always Enabled

Necessary cookies are absolutely essential for the website to function properly. This category only includes cookies that ensures basic functionalities and security features of the website. These cookies do not store any personal information.

Non-necessary

Any cookies that may not be particularly necessary for the website to function and is used specifically to collect user personal data via analytics, ads, other embedded contents are termed as non-necessary cookies. It is mandatory to procure user consent prior to running these cookies on your website.

SAVE & ACCEPT