Just recently, a client got some feedback from a committee member that the Analysis of Covariance (ANCOVA) model she ran did not meet all the assumptions.
Specifically, the assumption in question is that the covariate has to be uncorrelated with the independent variable.
This committee member is, in the strictest sense of how analysis of covariance is used, correct.
And yet, they over-applied that assumption to an inappropriate situation.
ANCOVA for Experimental Data
Analysis of Covariance was developed for experimental situations and some of the assumptions and definitions of ANCOVA apply only to those experimental situations.
The key situation is the independent variables are categorical and manipulated, not observed.
The covariate–continuous and observed–is considered a nuisance variable. There are no research questions about how this covariate itself affects or relates to the dependent variable.
The only hypothesis tests of interest are about the independent variables, controlling for the effects of the nuisance covariate.
A typical example is a study to compare the math scores of students who were enrolled in three different learning programs at the end of the school year.
The key independent variable here is the learning program. Students need to be randomly assigned to one of the three programs.
The only research question is about whether the math scores differed on average among the three programs. It is useful to control for a covariate like IQ scores, but we are not really interested in the relationship between IQ and math scores.
So in this example, in order to conclude that the learning program affected math scores, it is indeed important that IQ scores, the covariate, is unrelated to which learning program the students were assigned to.
You could not make that causal interpretation if it turns out that the IQ scores were generally higher in one learning program than the others.
So this assumption of ANCOVA is very important in this specific type of study in which we are trying to make a specific type of inference.
ANCOVA for Other Data
But that’s really just one application of a linear model with one categorical and one continuous predictor. The research question of interest doesn’t have to be about the causal effect of the categorical predictor, and the covariate doesn’t have to be a nuisance variable.
A regression model with one continuous and one dummy-coded variable is the same model (actually, you’d need two dummy variables to cover the three categories, but that’s another story).
The focus of that model may differ–perhaps the main research question is about the continuous predictor.
But it’s the same mathematical model.
The software will run it the same way. YOU may focus on different parts of the output or select different options, but it’s the same model.
And that’s where the model names can get in the way of understanding the relationships among your variables. The model itself doesn’t care if the categorical variable was manipulated. It doesn’t care if the categorical independent variable and the continuous covariate are mildly correlated.
If those ANCOVA assumptions aren’t met, it does not change the analysis at all. It only affects how parameter estimates are interpreted and the kinds of conclusions you can draw.
In fact, those assumptions really aren’t about the model. They’re about the design. It’s the design that affects the conclusions. It doesn’t matter if a covariate is a nuisance variable or an interesting phenomenon to the model. That’s a design issue.
The General Linear Model
So what do you do instead of labeling models? Just call them a General Linear Model. It’s hard to think of regression and ANOVA as the same model because the equations look so different. But it turns out they aren’t.
If you look at the two models, first you may notice some similarities.
- Both are modeling Y, an outcome.
- Both have a “fixed” portion on the right with some parameters to estimate–this portion estimates the mean values of Y at the different values of X.
- Both equations have a residual, which is the random part of the model. It is the variation in Y that is not affected by the Xs.
But wait a minute, Karen, are you nuts?–there are no Xs in the ANOVA model!
Actually, there are. They’re just implicit.
Since the Xs are categorical, they have only a few values, to indicate which category a case is in. Those j and k subscripts? They’re really just indicating the values of X.
(And for the record, I think a couple Xs are a lot easier to keep track of than all those subscripts. Ever have to calculate an ANOVA model by hand? Just sayin’.)
So instead of trying to come up with the right label for a model, focus instead on understanding (and describing in your paper) the measurement scales of your variables, if and how much they’re related, and how that affects the conclusions.
In my client’s situation, it was not a problem that the continuous and the categorical variables were mildly correlated. The data were not experimental and she was not trying to draw causal conclusions about only the categorical predictor.
So she had to call this ANCOVA model a multiple regression.
S Banerjee says
What is the role of the coefficient of determination. . In our experiment, post test score is DV, the group of students is subdivided into control unit and experimental units and one test scores form the coverage data.. The covariate is not linearly related to the dependent variable. We are facing a situation where the regression line slopes are significantly different , one of the regression lines corresponding to treatment level is parallel to the covariate axis with coefficient of determination close to zero ..What should be our course of action here. Can we use the difference in slopes to some advantage?
Karen Grace-Martin says
Sure. You’ll want to use an interaction term to reflect the difference in slopes and it sounds like you may need something like a quadratic term to deal with the non-linear relationship between the dependent variable and the predictor.
(And I fixed your typo as per your request)
in my analysis ANOVA (or better: its post tests) and Regression differ in significance. I only have dummy variables of one treatment (for the regression I insert four of the five in the estimation). I get the exact same effect sized, thus mean difference in post hoc test equals beta of the regression, BUT the coefficient is only significant for the regression, not in the post hoc test. Can you please hel me figure out why?
Thanks and regards,
Hi Karen! Thanks for clarifying.
I have one more question: What (if any) would be the difference between running ANCOVA and a dummy coded forced entry LM? Can I in fact do it both ways?
I am currently using a dataset with 4 categorial variables and 3 continuous ones. I would like to do hypothesis testing on my dataset. My trusted statistician told me yesterday that I should be doing F-Tests on that one (would be ANCOVA, right?) – so I am a little confused as I thought LR would be fine.
Thank you very much!
Hi Cornelia, I’m not sure what you mean by forced entry LM. But yes, you should get the same results from running an ANCOVA or a linear regression.
That said, some software has different defaults for things like which interactions get included if you run it through an ANCOVA procedure vs. a linear model procedure. But if the model is specified the same, you will get identical results. Either can give you F tests, but again, often linear regression *procedures* don’t print them all out.
Hi Karen, would you please suggest me a thesis topic for my masters which must be working on R and SPSS softwares.
It helps me to find a job after complition of my masters here in Europe
Amber Ward says
Love the title! So I now know I want to use ANCOVA. Just struggling to do a power analysis using GPower. Do you know how I should work out “df numerator” and “number of groups” (does this refer to each time a measurement will be made)?
If it helps – the study design is an intervention and I want to compare treatment group and control group test scores after treatment both at 6 and 12 months, controlling for baseline test scores.
I have estimated a GLM and included one 3-level categorical variable and several continuous variables — one of which is hypothesized to interact with the categorical variable. I asked SPSS for the regression parameters. The continuous variable is significant as a predictor in the ANOVA results table but is not a significant predictor in the Regression parameter results.
Ok so let’s say I knew there was significant positive correlation between the number of kids a couple has and their happiness. Could I then say that the
anova would be significant even if I made the catagorical 0,1-2, 3+ such That I may have lost informtation in my groupings
No, not necessarily. It may or may not.
Just wondering – does the covariate have to be continuous?
I am testing for group differences and want to use a MANOVA approach due to the DVs being meaningfully related.
Can I control for gender? (One of the groups has significantly more females than the other)
Many people mean a continuous variable when they say “covariate,” but not everyone. Yes, you can control for a categorical variable.
Joan Hendrikz says
I really love the way you explain various stats concepts, using metaphors and analogies as well to aid the process of understanding. I have also been in a stats advisory role throughout my career and it is great to see people walk away happy and excited with a new understanding of something heretofore a mystery. I find metaphors and analogies a very powerful tool to this end. Cheers, Joan.
I agree. I call it “the click.” You can see when someone gets a concept that they were bewildered by. It’s especially rewarding when they believed they couldn’t learn it at all. 🙂
Tobias Musyoka says
Karen you are a great statistician, and a good teacher too. I just wish that you taught me in class.
Aw, shucks. Thanks, Tobias. I’m glad you find the site helpful.
I do have to humbly admit, though, that you’re finding this helpful in part because it’s a review and it’s now in the context of your own research. You’ll really learn it now, and it is helpful to have good support at this stage (and that’s why I’m here), but I couldn’t teach so well if you didn’t have the background.
need real example for regression and explaining the output of it
If you want a real example of regression output, in one of my very first webinars, I did just that. I literally went through the output of a model with both categorical and continuous predictors (and an interaction), and we went step-by-step through how to read the coefficients.
You can get a free download here: Interpreting Linear Regression Parameters: A Walk Through Output