Every once in a while, I work with a client who is stuck between a particular statistical rock and hard place.
It happens when they’re trying to run an analysis of covariance (ANCOVA) model because they have a categorical independent variable and a continuous covariate.
The problem arises when a coauthor, committee member, or reviewer insists that ANCOVA is inappropriate in this situation because one of the following ANCOVA assumptions are not met:
1. The independent variable and the covariate are independent of each other.
2. There is no interaction between independent variable and the covariate.
If you look them up in any design of experiments textbook, which is usually where you’ll find information about ANOVA and ANCOVA, you will indeed find these assumptions. So the critic has nice references.
However, this is a case where it’s important to stop and think about
whether the assumptions apply to your situation, and how dealing with the assumption will affect the analysis and the conclusions you can draw.
A very simple example of this might be a study that examines the difference in heights of kids who do and do not have a parasite. Since a large contributor to children’s height is age, this is an important control variable.
In this graph, you see the relationship between age X1, on the x-axis and height on the y-axis at two different values of X2, parasite status. X2=0 indicates group of children who have the parasite and X2=1 is the group of children who do not.
Younger children tend to be afflicted with the parasite more often. That is, the mean age (mean of X1) of the blue dots is clearly lower than the mean age of the black stars. In other words, the ages of kids with the parasite are lower than those without.
So the independence between the independent variable (parasite status) and the covariate (age) is clearly violated.
How to Deal with Violation of the Assumptions
These are your options:
1. Drop the covariate from the model so that you’re not violating the assumptions of ANCOVA and run a one-way ANOVA. This seems to be the popular option among most critics.
2. Retain both the covariate and the independent variable in the model anyway.
3. Categorize the covariate into low and high ages, then run a 2×2 ANOVA.
Option #3 is often advocated, but I hope you will soon see why it’s unnecessary, at best. Arbitrarily splitting a numerical variable into categories is just throwing away good information.
Let’s examine option #1.
The problem with it is shown in the graph–it doesn’t accurately reflect the data or the relationships among the variables.
With the covariate in the model, the difference in the mean height for kids with and without the parasite is estimated for children at the same age (the height of the red line).
If you drop the covariate, the difference in mean height is estimated at the overall mean for each group (the purple line).
In other words, any effect of age will be added to the effect of parasite status, and you’ll overstate the effect of the parasite on the mean difference in children’s heights.
Why is it an assumption, then?
You are probably asking yourself “why on earth would this be an assumption of ANCOVA if removing the covariate leads us to overstate relationships?”
To understand why, we need to investigate the problem this assumptions is addressing.
In the analysis of covariance section of Geoffrey Keppel’s excellent book, Design and Analysis: A Researcher’s Handbook, he states:
It [ANCOVA] is used to accomplish two important adjustments: (1) to refine estimates of experimental error and (2) to adjust treatment effects for any differences between the treatment groups that existed before the experimental treatments were administered. Because subjects were randomly assigned to the treatment conditions [emphasis mine], we would expect to find relatively small differences among the treatments on the covariate and considerably larger differences on the covariate among the subjects within the different treatment conditions. Thus the analysis of covariance is expected to achieve its greatest benefits by reducing the size of the error term [emphasis Keppel’s]; any correction for pre-existing differences produced a random assignment will be small by comparison.
A few pages later he states,
The main criterion for a covariate is a substantial linear correlation with the dependent variable, Y. In most cases, the scores on the covariate are obtained before the initiation of the experimental treatment…. Occasionally the scores are gathered after the experiment is completed. Such a procedure is defensible only when it is certain that the experimental treatment did not influence the covariate….The analysis of covariance is predicated on the assumption that the covariate is independent of the experimental treatments.
In other words, it’s about not tainting the results that can be drawn by experimentally manipulated treatments. If a covariate was related to the treatment, it would indicate a problem with random assignment, or it would indicate that the treatments themselves caused the covariate values. These are very important considerations in experiments.
If however, as in our parasite example, the main categorical independent variable is observed and not manipulated, the independence assumption between the covariate and the independent variable is irrelevant.
It’s a design assumption. It’s not a model assumption.
The only effect of the assumption of the independent variable and the covariate being independent is in how you interpret the results.
So what is the appropriate solution?
The appropriate response is #2–keep the covariate in the analysis, and don’t interpret results from an observational study as if they were from an experiment.
Doing so will lead to a more accurate estimate of the real relationship between the independent variable and the outcome. Just make sure you’re saying that this is the mean difference at any given value of the covariate.
The last issue then becomes: If your critic has banned the word ANCOVA because you don’t have an experiment, what do you call it?
Now it’s down to semantics. It is accurate to call it a general linear model, a multiple regression, or (in my option), an ANCOVA (I have never seen anyone balk at calling an analysis an ANOVA when the two categorical IVs were related).
The critics who get hung up on this assumption are usually the ones who want a specific name. General Linear Model is too ambiguous for them. I’ve had clients who had to call it a multiple regression, even though the main independent variable was the categorical one.
One option is use “categorical predictor variable” instead of “independent variable” when describing the variable in the ANCOVA. The latter implies manipulation; the former does not.
This is a case where it’s worth fighting for your analysis, but not the name. The point of all this is communicating results accurately.