• Skip to primary navigation
  • Skip to main content
  • Skip to primary sidebar
The Analysis Factor

The Analysis Factor

Statistical Consulting, Resources, and Statistics Workshops for Researchers

  • Home
  • About
    • Our Programs
    • Our Team
    • Our Core Values
    • Our Privacy Policy
    • Employment
    • Guest Instructors
  • Membership
    • Statistically Speaking Membership Program
    • Login
  • Workshops
    • Online Workshops
    • Login
  • Consulting
    • Statistical Consulting Services
    • Login
  • Free Webinars
  • Contact
  • Login

Why ANOVA and Linear Regression are the Same Analysis

by Karen Grace-Martin 63 Comments

If your graduate statistical training was anything like mine, you learned ANOVA in one class and Linear Regression in another.  My professors would often say things like “ANOVA is just a special case of Regression,” but give vague answers when pressed.

It was not until I started consulting that I realized how closely related ANOVA and regression are.  They’re not only related, they’re the same thing.  Not a quarter and a nickel–different sides of the same coin.

So here is a very simple example that shows why.  When someone showed me this, a light bulb went on, even though I already knew both ANOVA and multiple linear regression quite well (and already had my masters in statistics!).  I believe that understanding this little concept has been key to my understanding the general linear model as a whole–its applications are far reaching.

Use a model with a single categorical independent variable, employment category, with 3 categories: managerial, clerical, and custodial.  The dependent variable is Previous Experience in months.  (This data set is employment.sav, and it is one of the data sets that comes free with SPSS).

We can run this as either an ANOVA or a regression.

In the ANOVA, the categorical variable is effect coded. This means that the categories are coded with 1’s and -1 so that each category’s mean is compared to the grand mean.

In the regression, the categorical variable is dummy coded**, which means that each category’s intercept is compared to the reference group‘s intercept.  Since the intercept is defined as the mean value when all other predictors = 0, and there are no other predictors, the three intercepts are just means.

In both analyses, Job Category has an F=69.192, with a p < .001.  Highly significant.

In the ANOVA, we find the means of the three groups are:

Clerical:      85.039

Custodial: 298.111

Manager:   77.619

In the Regression, we find these coefficients:

Intercept:    77.619

Clerical:         7.420

Custodial: 220.492

The intercept is simply the mean of the reference group, Managers.  The coefficients for the other two groups are the differences in the mean between the reference group and the other groups.

You’ll notice, for example, that the regression coefficient for Clerical is the difference between the mean for Clerical, 85.039, and the Intercept, or mean for Manager (85.039 – 77.619 = 7.420).  The same works for Custodial.

So an ANOVA reports each mean and a p-value that says at least two are significantly different.  A regression reports only one mean(as an intercept), and the differences between that one and all other means, but the p-values evaluate those specific comparisons.

It’s all the same model; the same information but presented in different ways.  Understand what the model tells you in each way, and you are empowered.

I suggest you try this little exercise with any data set, then add in a second categorical variable, first without, then with an interaction.  Go through the means and the regression coefficients and see how they add up.

**The dummy coding creates two 1/0 variables: Clerical = 1 for the clerical category, 0 otherwise; Custodial = 1 for the custodial category, 0 otherwise.  Observations in the Managerial category have a 0 value on both of these variables, and this is known as the reference group.

Four Critical Steps in Building Linear Regression Models
While you’re worrying about which predictors to enter, you might be missing issues that have a big impact your analysis. This training will help you achieve more accurate results and a less-frustrating model building experience.

Tagged With: analysis of covariance, analysis of variance, ancova, ANOVA, dummy coding, effect coding, linear regression

Related Posts

  • 3 Reasons Psychology Researchers should Learn Regression
  • SPSS GLM: Choosing Fixed Factors and Covariates
  • The General Linear Model, Analysis of Covariance, and How ANOVA and Linear Regression Really are the Same Model Wearing Different Clothes
  • Why ANOVA is Really a Linear Regression, Despite the Difference in Notation

Reader Interactions

Comments

  1. Andrew says

    October 26, 2020 at 9:21 am

    Why do researchers prefer anova instead of regression? I understand that it is quite the same.

    Reply
    • Karen Grace-Martin says

      November 2, 2020 at 10:07 am

      Hi Andrew,

      In experimental designs with many interactions, ANOVA best practices make the output easier to interpret. You’re still running the same underlying model, but you’re focusing on different ways of presenting the estimates of that model in order to answer the research questions.

      Reply
  2. Franziska says

    June 23, 2019 at 4:26 pm

    HI Karen,

    in my analysis ANOVA (or better: its post tests) and Regression differ in significance. I only have dummy variables of one treatment (for the regression I insert four of the five in the estimation). I get the exact same effect sized, thus mean difference in post hoc test equals beta of the regression, BUT the coefficient is only significant for the regression, not in the post hoc test. Can you please hel me figure out why?

    Thanks and regards,
    Franziska

    Reply
    • Karen Grace-Martin says

      August 22, 2019 at 1:44 pm

      Franziska,

      The post hoc tests and the regression coefficients are doing slightly different mean comparisons, usually.

      Reply
  3. Geoff Stuart says

    December 18, 2018 at 8:11 pm

    This is not strictly true when there are more than two factors, as explained here:
    https://www.cscu.cornell.edu/news/statnews/stnews13.pdf

    An important difference is how the F-ratios are formed. In ANOVA the variance due to all other factors is subtracted from the residual variance, so it is equivalent to full partial correlation analysis. Regression is based on semi-partial correlation, the amount of the total variance accounted for by a predictor.

    Of course it is possible to run a multi-way ANOVA using a regression program, but the standard approach will not give the same results.

    Reply
    • Karen Grace-Martin says

      March 4, 2019 at 11:47 am

      Hi Geoff,

      Yes, there are often differences in software defaults, but they can be changed. The underlying model are the same, though.

      Reply
  4. Matt says

    September 10, 2018 at 4:35 pm

    I have a single dependent continuous variable (marks) and four predictors in likert scale format. How do I build a regression(multiple) model?

    Reply
  5. Jennifer Leigh Faulkner says

    August 1, 2018 at 12:32 pm

    I wanted to follow along by using the same data set to do an ANOVA and regression like you did. I did the ANOVA and got the same means. However, I must not have done the regression correctly because I got b0=83.901, b1=8.474, and Beta1=.063. Was I supposed to put Employment Category into Block 1 of 1? Thank you in advance.

    Reply
    • Karen Grace-Martin says

      August 17, 2018 at 1:52 pm

      Jennifer, it’s hard for me to figure out what went wrong without seeing it. If you post the syntax from the model you ran, I might be able to help.

      But perhaps you didn’t dummy-code the Employment Category variable first? You can’t just put it in the model without dummy coding.

      Reply
  6. Matt M says

    June 8, 2018 at 8:22 pm

    I have been trying to put this into words for some time now. I suppose I should have just ran the two tests and compared the results as you did. Either way, excellent work. Very well explained.

    Reply
  7. Mees says

    February 19, 2018 at 6:21 am

    Dear all,

    I want to research the effect of gender(male/female) and ethnicity(black/white/asian) on the source of funding(bank/own/VC/family). As all variables are non-numeric how can I perform a regression?

    Reply
    • Karen Grace-Martin says

      May 17, 2018 at 10:03 am

      Mees, when the *dependent* variable is categorical, you’ll need to do a logistic regression. See this page for resources on it: https://www.theanalysisfactor.com/resources/by-topic/logistic-regression/

      Reply
  8. Karthik says

    January 3, 2017 at 5:26 am

    HI guys,

    I have the salary dataset that contains skills as single column(Independent Variable) and Salary as Dependent Variable. Then I split the skill column into multiple Skill column based on its presence 0 or absence 1.
    eg:
    emp_id skills Salary
    1 R,python,excel,word 4000

    I made dataset transition like this:

    emp_id R Python Excel word Java Salary
    1 1 1 1 1 0 4000

    Then i performed multiple linear regression, to find out the skills influencing salary most. I have summary of results.

    My question is that, is the only analysis we can do or what are all the other alternative analysis we can do to predict the salary.

    Reply
  9. Kristyn says

    November 11, 2016 at 4:08 pm

    I have 2 continuous DVs and 2 categorical DVs. I want to run a regression because I want the beta coefficients for the continuous variables but I also want to run an ANOVA so that I can look at pairwise comparisons for the categorical DVs. Is it ok to run both and use results from both in my reporting?

    Reply
  10. Faiza says

    October 17, 2016 at 2:09 pm

    Thank you for the post.
    Could help me, please?
    for PhD student, what should I examine: (1) the differences between four groups and use one-way ANOVA, or (2) the relationships between variables and use regression.

    Reply
  11. Thom says

    July 28, 2016 at 1:14 pm

    Can anybody tell me if you can use AIC to compare several one- and two-way ANOVAs the way you compare linear models? Thanks.

    Reply
  12. Oliver says

    July 9, 2016 at 6:01 pm

    I just read “NOT-Statistik” by “Barbera Bredner” and she discusses that on page page 9. She says that the distinction grew historically. The equality of the models is said to be described by Rutherford (2000): Rutherford, Andrew (2000). Introducing Anova and Ancova: A GLM Approach (Introducing Statistical Methods). Sage Publications Inc. ISBN: 9780761951612. (https://uk.sagepub.com/en-gb/eur/introducing-anova-and-ancova/book205673)

    Excerpt: “Eine Aufteilung in ANOVA-, Regressions- und Kovarianz-Modell ist aus der Sicht des Anwenders rein von akademischen Interesse, da ANOVA- und Regressions-Modelle Spezialfälle der Kovarianzmodelle sind”. This roughly translates to: “ANalysis Of VAriance (ANOVA) and regression models are special cases of the ANalysis of COVAriance”. She also says that linear model is a synonym vor ANCOVA.

    Reply
  13. Marina Damasceno Pereira says

    May 22, 2016 at 1:31 pm

    This text was super enlightening, thanks!!!

    but is still have 1 doubt: what about if your DV has more than 2 categories? not only 1/0. because how I would “categorize” my DV? Something like (e.g. positive; indifferent or negative > as possible outcomes.
    Then, I would the “dummy” thing is also applied here? like, negative = 0; positive = 1 and indifferent = 2 ????

    my DV is categorical and 3 IV continuous.

    and really doesn’t matter if I use ANOVA or regression?

    Reply
    • Brijender Singh says

      July 11, 2016 at 3:53 am

      Number of dummy variable are always (n-1), where n is the total number of variables

      for example: if you have three variables +ve, -ve and indifferent.
      No. of dummy variables required is 3-1=2.
      how it works:
      +ve (dummy) -ve(dummy) indifferent
      1 0
      0 1
      0 0 1

      so if you make +ve and -ve dummy variables then Indifferent becomes your reference variable.
      so when the response is positive, the result is 1 (for +ve), 0 for (-ve)
      when the response is negative, the result is 0(for +ve), 1 for (-ve)
      when the response is neither positive nor negative, the result is 0(for +ve), 0 for (-ve). But this this its 1 for Indifferent.

      I hope this makes sense !

      Reply
  14. Kieran says

    January 19, 2016 at 7:33 am

    This article was really eye opening, can’t believe I’ve never seen it like this before!

    Really quick question regarding intercepts and means. We have multiple technicians repeatedly testing different materials and we have used a mixed model (PROC MIXED) using technicians as fixed factor. For most materials the intercept is very close to the calculated grand mean but for one, the intercept is about 1/3 of the grand mean.

    I was just wondering what could cause the intercept to be so far from the calculated mean.

    We were hoping to use intercept and variance to calculate confidence intervals.

    Any guidance would be much appreciated!!

    Kieran

    Reply
  15. Sri says

    December 2, 2015 at 1:35 am

    Thanks Karen, This is a wonderful explanation. The concepta are getting clearer to me.

    Best Regards,

    Reply
  16. Lisa C. Jenkins says

    August 22, 2015 at 12:12 pm

    This post is a God-send … a life-saver … now I can complete and defend my dissertation before September 20th!!! 🙂

    Reply
  17. Ricardo says

    June 9, 2015 at 7:30 pm

    Hello Karen,

    I already knew about how to match the results of a multiple linear regression with ANOVA. But, now i’m trying to run a 3×3 factorial desing, and i want to know if each factor have a significant quadratic effect. I used SPSS and MANOVA with contrast matrix to obtain the Cuadratic effect of each factor, and in the linear regression i created a new variable multiplying each factor by itself. Anyways, the F-Test and the P-value are different. ¿Any ideas?

    Thanks,

    Reply
    • Karen says

      June 23, 2015 at 12:28 pm

      Are you treating those factors as categorical in both models? Most regression procedures can’t accommodate factors without you explicitly dummy coding them.

      Reply
  18. abdul says

    April 2, 2015 at 2:44 am

    sir kindly guide me about MANOVA and multiple regression. is they same or not?

    Reply
  19. Jaber Aberburg says

    February 5, 2015 at 10:51 am

    Hi Karen,

    I was wondering, in your example, what if this is not the case: “Since the intercept is defined as the mean value when all other predictors = 0, and there are no other predictors, the three intercepts are just means.” …but rather that there are let’s say two other predictors. Would then the intercept for a category be an “adjusted” mean value for that category?

    Thank you!

    Reply
    • Karen says

      February 6, 2015 at 5:04 pm

      Hi Jaber,

      Yes, if there are predictors, then the intercept is the mean of Y conditional on all X=0. Not all X will always have values in the data set that =0, so it’s not always meaningful. But if they do, then yes.

      Reply
  20. Wander says

    December 12, 2014 at 2:47 am

    Good post. One can use effect coding for regression.

    Reply
  21. sdcandan says

    October 24, 2014 at 2:56 am

    Yes Karen
    You are right I didnt dummy code categorical predictor when putting it into regression model. I realised it later after I posted my question.
    I understand that there are other coding (effect coding etc) schemes for categorical predictors each leading to different regression coefficients. That is another topic I need to investigate. I guess these are closely related to contrasts.

    Reply
  22. Sadik says

    October 18, 2014 at 12:00 pm

    Hi Karen
    First of all, I have been following your site and found it very informative. So I must thank you for that.
    Secondly I was investigating the same issue, ie anova vs regression. Although I have seen in many internet resources claiming them be the same, I wanted make sure and therefore tried the data in your post. But I couldnt replicate your results. I guess you did a one way ANOVA and a univariate model fit in SPSS, rather than doing a one way ANOVA and linear regression. Because when I fit a linear regression in SPSS, I get 83.901 as intercept and 8.474 as being slope. ANOVA tables were different neither.
    So I am confused.

    Reply
    • Karen says

      October 20, 2014 at 9:22 am

      Hi Sadik,

      I’m guessing that you didn’t dummy code the Job Category variable, since you have only one slope coefficient. If you don’t, SPSS will read your three category codes as true numbers.

      And yes, Univariate GLM will do this for you, but Linear Regression procedure will not. But if you code it correctly, you’ll get identical results.

      Reply
  23. Tom says

    October 1, 2014 at 4:14 am

    Nice article. Maybe it would be more precise to say that ANOVA is a special case of regression. Because regression can do more than just this ANOVA model.

    Reply
    • Karen says

      October 20, 2014 at 9:28 am

      Hi Tom,

      It may be more precise, indeed. However, when I was a grad student I kept hearing this and found it quite unhelpful. I find thinking of it this way makes more sense. YMMV.

      Reply
  24. heather says

    September 26, 2014 at 12:01 am

    I have a question about anova and regression models
    why equal means and separate mean models can be compared when they are not nested
    Thanks

    Reply
  25. Ayana says

    September 20, 2014 at 10:46 am

    This really helps to clarify things. Thanks!

    Reply
  26. Eduardo R. Cunha says

    September 8, 2014 at 6:54 pm

    Hi guys,
    Reading this post I was curious about any ad hoc tests that could be employed to test differences between every pair of treatments, just like the common post hoc comparison for ANOVA. As Karen has already highlighted post-hoc comparisons is different from planned comparisons, so I think any kind of contrasts would not be appropriate. So, does this post-hoc for dummy variables regression exists?

    Eduardo

    Reply
    • Wander says

      December 12, 2014 at 2:10 am

      Edurado,

      Yes, regression can do the same work. Indeed, multiple comparison is not even directly related to ANOVA. You need to adjust p-values for multiple comparison because you conduct multiple independent t-test. You don’t actually need to conduct ANOVA if your purpose is a multiple comparison.

      Reply
  27. Dean says

    August 19, 2014 at 11:57 am

    Hi Karen. As a FORMER 6 Sigma Quality Black Belt it has been a while since I have done an in depth study but I recently ran a simple 2 to the 2 DOE that resulted in an R squared of 84.78% but when I ran a regression analysis (both on Minitab 15) I only get an R squared of 62.6%. Can you help me understand why there would be that big a difference? Thanks!

    Reply
    • Karen says

      August 19, 2014 at 5:09 pm

      Hi Dean,

      There shouldn’t be. It must be differences in defaults between the two procedures. For example, one is assuming the variables are continuous and the other categorical. Or one defaults to a main-effects only model whereas the other includes and interaction.

      Reply
      • Dean says

        August 20, 2014 at 7:26 am

        Hi Karen, no, niether of those scenarios. The data is somewhat non-normal. Could that account for it or should they be the EXACT R Squared regardless? Thanks again!

        Reply
        • Dean says

          August 20, 2014 at 8:01 am

          Hi, me again. I take it back, my regression equation does not include interactions (that is, my equation only shows a constant and the main effects) , I ASSUMED that MINITAB automatically accounted for that. I can’t find an option to include interactions (maybe is there but I don’t see it, are you familiar with MINITAB and if so how to include interactions in the regression analysis?).

          Reply
          • Karen says

            August 25, 2014 at 3:58 pm

            Hi Dean,

            I have used Minitab, but I don’t currently have it and don’t remember the defaults. However, in most statistical software, the only way to include an interaction in a linear regression procedure is to create an interaction variable.

            So literally, if you want an interaction term for X*Z, create a new variable that is the product of X and Z.

            Then add it to your linear regression.

            Karen

  28. Peter says

    August 8, 2014 at 8:57 am

    How would you explain the difference between a 2-way ANOVA when both ways are categorical, but one is defined by a measurable variable (e.g., temperature at which a strain of fruit fly is raised), and a linear regression with a dummy for the other variable? In particular, the ANOVA has an interaction component, while the regression doesn’t.

    Reply
    • Karen says

      August 25, 2014 at 4:01 pm

      Hi Peter,

      Sure, but those are only the defaults. You don’t have to include an interaction in an ANOVA and you always can include an interaction in a linear model.

      But these defaults reflect the way people usually use ANOVA and Regression, so sometimes the software makes changing the defaults difficult. There are always ways around it though, and there are no theoretical reasons not to.

      Karen

      Reply
  29. Lara says

    July 8, 2014 at 5:02 am

    Hello, thanks for the post. i dont know if you are still active in replying to comments but i thought i might try as Im currently stuck in Master Thesis analysis.

    After reading this I tend to do a regression but maybe you can give me a tip.

    I have a categorical IV and 5 continuous Dvs and 5 control variables that are continuous as well. I have to check for moderationd and mediation as well.

    In the beginning i thought i run some Anovas or a Manova to check if there a differences between the groups for the DV’s. But now im struggeling as i dont know how to integrate the control variables.

    Is it correct that i have to run a multiple regression and recode my IV into a dummy ?

    Id be so thankful and happy to receive your answer.
    Thanks!

    Reply
    • Karen says

      July 14, 2014 at 11:46 am

      Hi Lara,

      I do still answer questions when I can, but sometimes go through periods where I can’t. I do always answer questions right away in the Data Analysis Brown Bag or in Quick Question consulting, so if you ever need a quick answer, either will get you one.

      It depends which software you’re using, but all should allow you to include those covariates in a MANOVA. I suppose technically at that point it would be called MANCOVA, but you’re software doesn’t care what you’re calling it. 🙂

      It depends on what software you’re

      Reply
  30. Conor says

    January 24, 2014 at 11:56 am

    Hi Karen, I just read your post there and found it incredibly interesting. It did, however, cause me to worry very much so about my current thesis, and the analysis I should use for it.
    I am looking to see if employees’ organisational history (such as voluntary/involuntary turnover) and their contract type (part or full-time) impacts on their organisational commitment and empowerment.
    So I have 2 Dependent Variables (Commitment Questionnaire and Empowerment Questionnaire, both Likert-scales)
    And I have 3 Independent Variables (Occupational history (3 levels), contract type (2 levels) and I want to use gender as well, to see this impacts on the difference).

    I have attempted a Two-Way ANOVA, using only 2 IV (Occupational history and contract type) with both of the DV, but both times my Levene’s score was significant, and I could not find an answer as to what to do if this is the case? does this mean I cannot use an ANOVA, as the assumptions are not met? And for that reason would I be better to use a regression with 3 IV?

    Reply
    • Karen says

      January 24, 2014 at 1:15 pm

      Hi Conor,

      First of all Levene’s test is not very reliable. (Geoffrey Keppel’s ANOVA book discusses this). I would suggest testing the heterogeneity of variance assumption a few different ways.

      Even so, you’ll get exactly the same results if you run it in regression instead of ANOVA. They’re mathematically equivalent.

      Reply
  31. John Meyers says

    October 29, 2013 at 9:58 am

    Interesting! One question though: When one does Anova, one is usually advised not not do basic t-tests between two individual categories, but instead use post-hoc tests that adjust alpha levels to multiple comparisions. When one does a t-test on a regression coefficient to see if it is significantly different from zero, does this issue not arise? I am wondering, because from your example I understand that an individual coefficient is just the difference between the respective group and the comparision group.thanks in advance for any pointers where my logic is wrong.

    Reply
    • Karen says

      October 29, 2013 at 10:31 am

      Hi John,

      That’s a great question. There are two reasons. One is not really about doing t-tests in the ANOVA, but doing all pairwise t-tests. The regression situation is identical to a set of orthogonal contrasts in ANOVA. Because they’re independent of each other, they don’t influence each other. If you kept changing the reference group in the regression in order to do more comparisons, then you’d need to do some adjustment.

      The other is the idea of post-hoc comparisons vs. planned comparisons. Whenever comparisons are made post-hoc (you’re just looking to see which means are different), you need the adjustment. That’s different from choosing which contrasts you want to make.

      Reply
  32. Sam says

    September 13, 2013 at 3:00 pm

    I also just have a quick question. When dummy coding the reference category stays at zero along with one of the categories. In this case there were two dummy variables and Managers were coded zero along with another category on each dummy variable.

    I was just wondering how the regression analysis knows that the reference category is indeed the reference category and treats it as a constant and differentiates it from the other category which is also coded as zero?

    I got the same results of you so I don’t think there was an issue with my dummy coding. I followed the recommendations of Field (2009).

    Reply
    • Karen says

      September 25, 2013 at 10:23 am

      Hi Sam,

      It works because there are only two dummy variables for the three variables. Only individuals in the dummy category have zeros on BOTH these variables. I find the way to really show this is to walk you through the equations, which I clearly can’t do here (it takes me a good 1/2 hour to walk someone through it). But I did a webinar on it, if you’re really interested in seeing how dummy coding works: Dummy Coding and Effect Coding in Regression Models. We also go through this in detail, with and without interactions in my Interpreting (Even Tricky) Regression Coefficients workshop.

      Reply
  33. Sam says

    September 13, 2013 at 2:09 pm

    Thanks for this post. I looked for the SPSS file entitled ’employment.sav’ but it was not there. I did find one called ‘Employee data.sav’ though which I believe is the same data you used (n=474). Just thought I would highlight this because if anyone is playing with this data they may not be able to find it. I am using SPSS 20, so perhaps there have been changes to names…

    Reply
  34. Hongyuan He says

    August 24, 2013 at 7:53 am

    As a lowly Bachelor of engineering, here is why you are wrong:

    If we ignore rank-deficient cases, then linear regressions only have a non-zero residual if there are more data points than parameters (overdetermined). In contrast, the residual is zero if there are the same number of data points as parameters (critically determined). In this latter, degenerate case, the “regression” always produces a perfect fit.

    So, what happens when we apply ANOVA and regression to grouped data?

    – With ANOVA, variance is partitioned into between-level and within-level.

    – With critically determined linear regression:
    — The model can account for all of the between-level variance, because the number of *unique* predictors (i.e. number of levels) equals the number of parameters. Put another way, the model can pass through all group means.
    — The residual will be the within-level variance.

    – With overdetermined linear regression:
    — The model will only account for some of the between-level variance.
    — The residual will be the within-level variance, *plus* the remaining between-level variance.

    The regression you described is critically determined, because the number of (unique) input levels and parameters are both 2. The input levels are, e.g. Clerical and Non-Clerical; and the parameters are intercept and slope because you are fitting a line. We expect that this regression and ANOVA will partition your variance in the exact same way, and because they arrive at the same partition, they will have the same mean for the groups as you have observed.

    However, if ever you were to have a category with more levels (say, “Year”), then we are instead looking at an overdetermined linear regression. A line (or even a parabola) will now fail to model the entirety of the between-level variance, and the model estimates won’t correspond to any of the ANOVA means.

    In summary, we can only say that ANOVA produces equivalent results to linear regressions that are critically determined. You cannot claim that ANOVA is the same as linear regression. Not only is this claim wrong, it is wrong in a subtle enough way that it will condemn readers to many headaches before (and if) they ever claw back to the truth.

    I hope the stats book you published doesn’t spread the same misinformation. People learn from your writings and trust you as an expert on the subject matter, so please understand that I am just trying to help.

    Reply
    • Karen says

      September 4, 2013 at 3:29 pm

      Hi Hongyuan,

      Thanks for your trying to help. Sincerely.

      Based on your arguments, it sounds like you believe that I’m suggesting that every value of a continuous predictor (X) in regression could be treated as a different grouping variable in ANOVA. And you are absolutely correct that that approach would lead to a big, ugly mess. If I’m misreading your concerns, please let me know.

      That’s not what I meant at all.

      The key phrases are “Use a model with a single categorical independent variable” and “In the regression, the categorical variable is dummy coded**.” The asterisks lead to a footnote at the bottom that shows that in the regression model, there are only two IVs, each of which have two values: 1 and 0. If you’d like to read up more on dummy coding, here is a good description: http://www.ats.ucla.edu/stat/mult_pkg/faq/general/dummy.htm.

      This works because even if there are two values, Clerical (1) and Non-Clerical (0), there is *more than one observation with each of those values.* You’re right, if the n=2, it wouldn’t work either. And I never did mention my sample size, but it was 474. This is a data set that one would typically run as a one-way anova. But if you dummy code, it works just as well in a linear regression and the values of the F test will be identical. Try it.

      Reply
      • Hongyuan He says

        September 10, 2013 at 11:53 pm

        Hi Karen,

        Thank you for your response. What I believed you meant was that one can consider ANOVA and regression as the same concept, and still be fine. I don’t think that statement can be considered correct because they generally produce profoundly different results.

        The actual sample size is irrelevant. Nowhere in my original post did I mention anything about sample size needing to be 2. The fact that you had 2 *levels*, or groups (0 and 1) implies that your F-test results and group means will be identical between slope-intercept regression and ANOVA. We don’t even need to crunch the numbers to see why this is the case.

        Conversely, had there been 3 levels, the results of slope-intercept regression wouldn’t be the same as ANOVA at all. (But regression with a parabola, having 3 parameters, would still be identical to ANOVA, etc.)

        Hongyuan

        Reply
        • Wander says

          December 12, 2014 at 2:05 am

          Hi Hongyuan,

          I read your and Karen’s posts. However, I do not understand. What do you exactly mean by the results of slope-intercept regression? You mean regression with an intercept? If this is the case, they are exactly same (assuming that residuals are normally, homogeneously, and independently distributed. Not only coefficients but also others including the total sum of squares, the explained sum of squares, the residual sum of squares, etc. Researchers from fields that rarely deal with an experimental design do not even need to care about ANOVA. However, unfortunately, ANOVA is still taught because it’s simply there and frequently, instructors failed to recognize that they are same.

          Reply
          • Hongyuan He says

            December 20, 2014 at 5:13 am

            Wander,

            Let’s say you had multiple groups (e.g. a school with 12 grades and 10 kids in each grade), and the data we were looking at was each kid’s grade on a test (let’s say they ALL took the same test).

            You’d have a few sources of variance here: (a) within-group (e.g. between the kids in each class), and between-group variation of the average score in each class; which can further be divided into (b) that which follows some functional model (say, maybe average scores linearly increase with age!), and (c) the residual from such regression.

            Now, if average scores did increase in a line (or parabola, sine function, or whatever model we chose in our linear regression), then there would be no (c) and the only sum-squares would come from (a) and (b).

            But in general, all three categories here are distinct. And while linear regression only distinguishes between (b) and “(a)+(c)”, ANOVA will give you the exact sum-squares contribution of each of (a), (b), and (c).

            Hongyuan

        • Simon says

          August 14, 2017 at 3:22 am

          I think Karen pretty much answered your inquiry. ANOVA and multiple regression are USUALLY overdetermined, because in most cases number of parameters we’re trying to estimate are smaller than number of data points. That’s why Karen mentioned that sample size n was larger than 2. The whole point of least-squares method is to solve overdetermined regression, and ANOVA is pretty much using the exact same method. I just ran an ANOVA and linear multiple regression of a variable with 3 categories, dummy coding 2 groups to allow regression. The results yielded exact same statistics for between and within-groups variances.

          Reply
  35. Mike says

    May 24, 2013 at 9:01 pm

    Would it ever be the case that the significance tests of the regression coefficients would come out non-significant when the overall F-test did come out significant? What if, for example, you had a factor with three levels, A, B, and C, with means 3, 5, and 4. If C is the reference level, could it be the case in the regression model that neither the coefficient comparing A to C nor the coefficient comparing B to C would be significantly different from 0, but that the F-statistic would be significant due to the difference between A and B?

    Reply
    • Karen says

      June 6, 2013 at 5:19 pm

      Yes. They’re testing slightly different things, as you’ve noticed,. and you’ve hit the difference exactly.

      Reply
  36. ATJ says

    April 24, 2013 at 10:15 am

    Great post; thanks for sharing!

    Reply

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Please note that, due to the large number of comments submitted, any questions on problems related to a personal study/project will not be answered. We suggest joining Statistically Speaking, where you have access to a private forum and more resources 24/7.

Primary Sidebar

This Month’s Statistically Speaking Live Training

  • February Member Training: Choosing the Best Statistical Analysis

Upcoming Workshops

  • Logistic Regression for Binary, Ordinal, and Multinomial Outcomes (May 2021)
  • Introduction to Generalized Linear Mixed Models (May 2021)

Read Our Book



Data Analysis with SPSS
(4th Edition)

by Stephen Sweet and
Karen Grace-Martin

Statistical Resources by Topic

  • Fundamental Statistics
  • Effect Size Statistics, Power, and Sample Size Calculations
  • Analysis of Variance and Covariance
  • Linear Regression
  • Complex Surveys & Sampling
  • Count Regression Models
  • Logistic Regression
  • Missing Data
  • Mixed and Multilevel Models
  • Principal Component Analysis and Factor Analysis
  • Structural Equation Modeling
  • Survival Analysis and Event History Analysis
  • Data Analysis Practice and Skills
  • R
  • SPSS
  • Stata

Copyright © 2008–2021 The Analysis Factor, LLC. All rights reserved.
877-272-8096   Contact Us

The Analysis Factor uses cookies to ensure that we give you the best experience of our website. If you continue we assume that you consent to receive cookies on all websites from The Analysis Factor.
Continue Privacy Policy
Privacy & Cookies Policy

Privacy Overview

This website uses cookies to improve your experience while you navigate through the website. Out of these, the cookies that are categorized as necessary are stored on your browser as they are essential for the working of basic functionalities of the website. We also use third-party cookies that help us analyze and understand how you use this website. These cookies will be stored in your browser only with your consent. You also have the option to opt-out of these cookies. But opting out of some of these cookies may affect your browsing experience.
Necessary
Always Enabled

Necessary cookies are absolutely essential for the website to function properly. This category only includes cookies that ensures basic functionalities and security features of the website. These cookies do not store any personal information.

Non-necessary

Any cookies that may not be particularly necessary for the website to function and is used specifically to collect user personal data via analytics, ads, other embedded contents are termed as non-necessary cookies. It is mandatory to procure user consent prior to running these cookies on your website.

SAVE & ACCEPT