Covariate is a tricky term in a different way than hierarchical or beta, which have completely different meanings in different contexts.

Covariate really has only one meaning, but it gets tricky because the meaning has different implications in different situations, and people use it in slightly different ways. And these different ways of using the term have BIG implications for what your model means.

The most precise definition is its use in Analysis of Covariance, a type of General Linear Model in which the independent variables of interest are categorical, but you also need to control for an observed, continuous variable–the covariate.

In this context, the covariate is always continuous, always a control variable, and always observed (i.e. observations weren’t randomly assigned it’s values, you just measured what was there).

A simple example is a study looking at the effect of a training program on math ability. The independent variable is the training condition–whether participants received the math training or some irrelevant training. The dependent variable is their math score after receiving the training.

But even within each training group, there is going to be a lot of variation in people’s math ability. If you don’t control for that, it is just unexplained variation. Having a lot of unexplained variation makes it pretty tough to see the actual effect of the training–it gets lost in all the noise.

So if you use pretest math score as a covariate, you can control for where people started out. So you get a clearer picture of whether people do well on the final test due to the training or due to the math ability they had coming in.

Okay, great. Where’s the confusion?

### Covariates as Continuous Predictor Variables

The confusion is that, really, the model doesn’t care that the covariate is something you don’t have a hypothesis about. Something you’re just controlling for. Mathematically, it’s the same model, and you run it the same way.

And so people who understand this often use the term covariate to mean ANY continuous predictor variable in your model, whether it’s just a control variable or the most important predictor in your hypothesis. And I’m guilty as charged. It’s a lot easier to say covariate than continuous predictor variable.

But SPSS does this too. You can run a linear regression model with only continuous predictor variables in SPSS GLM by putting them in the Covariate box. All the Covariate box does is define the predictor variable as continuous.

(SAS’s PROC GLM does the same thing, but it doesn’t specifically label them as Covariates. In PROC GLM, the assumption is all predictor variables are continuous. If they’re categorical, it’s up to you, the user, to specify them as such in the CLASS statement.)

### Covariates as Control Variables

But the other part of the original ANCOVA definition is that a covariate is a control variable.

So sometimes people use the term Covariate to mean any control variable. Because really, you can covary out the effects of a categorical control variable just as easily.

In our little math training example, you may be unable to pretest the participants. Maybe you can only get them for one session. But it’s quick and easy to ask them, even after the test, “Did you take Calculus?” It’s not as good of a control variable as a pretest score, but you can at least get at their previous math training.

You’d use it in the model in the exact same way you would the pretest score. You’d just have to define it as categorical.

Once again, there isn’t really a good term for categorical control variable, so people sometimes refer to it as the covariate.

### So what is a covariate then?

It’s hard to say. There is no disputing the first definition, so it’s clear there.

I prefer to just be careful, in setting up hypotheses, running analyses, and in writing up results, to be clear about which variables I’m hypothesizing about and which ones I’m controlling for, and whether each variable is continuous or categorical.

The names that the variables, or the models, end up having, aren’t important as long as you’re clear about what you’re testing and how.

———————————————————————————-

Read other posts on Confusing Statistical Terms and/or check out our workshops and other resources related to ANOVA and ANCOVA.

{ 48 comments… read them below or add one }

One other issue with covariates arises when actually using the results in a subsequent decision process. Variables controlled for are akin to what is lost when partially differentiating. It is important to understand the need to put this information back when making decisions. By this, if we take the math training example you used above, within the population studied, there may be subsets who do well with a particular method and those who don’t. If you simply control for this variation with pre-test scores, you effectively average the variation away in you analysis. When this is done, it is critically important to understand that you are seeing an averaged picture (and that many potentially important pieces of information are absent). I loved how Good and Hardin saw fit at the beginning of the first chapter of their Common Errors in Statistics, to state unequivocably that statistics should never be used as the sole basis for making decisions. Thus ‘missing or discarded information’ is in part, to me, why. (Essentially you need either strong uniformity conditions on a population, or explicit inspection, before you can be confident a statistical result applies to a particular instance.)

Hi Karen,

I have 45 participants who received an exercise intervention. I am observing the change in their physical performance pre-post and after 8 months of the intervention. I have found group differences by using RM ANOVA. Now I need to control some covariates/ confounding variables (categorical variables) like age group, marital status, education level etc. How shall I do it? I am using SPSS version 22. Please suggest the most suitable tool to use. Please also comment if I could get that tool in the following options:

Option 1: While defining the independent variable in the process of RM ANOVA, there are two more spaces available to put data on ‘between subject factor’ and ‘covariate’. Shall I put all 8 covariates under the space labelled ‘covariates’? If I can do that,

Q1. Will that actually control these variables all together?

Q2. Will it lose power in doing so?

Q3. How shall I then term the tool of analysis? RM ANOVA with covariates or RM ANOVA or any other term?

Q4. Shall I use them as time*cov for all the covariate separately?

Option 2: Shall I use mixed method ANOVA by putting one particular covariate in the ‘between subject factor’ and see if it has any effect?

Q.1 Shall I look at the between subject effect in the output file to see the impact? Or need to look at the effect size or both?

Q2. Shall I need to put other 7 covariates under the space labelled ‘covariates’ while considering one covariate as a between subject factor?

Q3. How shall I then term the tool of analysis?

Looking forward to hearing from you.

Many thanks in advance.

Hi, Karen,

I really like your explanation about the term. Enjoyed reading it. Thanks a lot. It certainly helps to stop the arguments between my students and me.

Best,

Claudia

Hi Karen,

Thanks for the helpful article. I’m trying to decide which variables to include as control variables in my regression model. Aside from theoretical reasons, I have examined my correlation matrix. I have a variable that correlates with my IV, but not with my DV…Am I correct in assuming that this variable should NOT be controlled for in the model (since it is unrelated to the DV)?

Thanks for your help!

Karen,

Thanks so much for this clarification.

So, in other words, if I want to control for a categorical variable, I still run ANOVA. and if I want to control for a continous variable, I run ANCOVA. *phew*

Yes, exactly.

Hi Karen, thank a lot for the site. Its really great.

Currently, I’m designing an experiment on stress treatment of plants with single (heat or drought) or combined (heat+drought) fixed effect IVs to measure one or more response DVs. Each stress treatment will be applied with several levels at (combined with) 3 time points. Measurement(s) are from different experimental units and not from the same individual plant. I’m thinking using either 2-way ANOVA or MANOVA depends on the no. of DV to be measured. However, I’m a little bit confused about the time point factor. I would expect that the measurements from control plants will not be significantly changed by time (as there is no stress), but under (e.g. heat treatment) I would expect to find a significant change in the main effect factor between the 3 time points. Shall I consider the time point as a covariance factor in this case, and use, for example, one-way ANOVA instead of two-way to measure one DV under one heat. Do you recommend any specific analysis or model different from above.

I would appreciate a lot your advice and help

Hi Karen,

I need some help with choosing the statistical analysis.

Details of my study.

all participants will complete surveys measuring self compassion, whether their psychological needs are met, how high on non attachment scale they are and other trait scales.

then, Half will be given some induction training on how to meditate. half will not.

after that, everyone will be tested to see how mindful and meditative they were using some breath counting measures.

i want to see relationship between the outcome and the survey responses and the training received.

for eg. someone whose needs are met and who received training did well on being mindful.

what about someone whose needs are met but was part of the group that did not receive training, what if he does well or what if he doesn’t.

is the survey responses the covariates? How would i write up the research question?

would the test be correlation or regression? ANCOVA?

My original hypothesis was needs met led to being good at meditation. But now the control group has been included, I am confused.

Please help.

Thanks in advance

Just wanted to say thank you for the site and the easy to follow text. Great job!

Hi Karen!

First of all, thank you for your site because it has helped me a lot of times 😉

I have a doubt in the statistical analysis of my study. I am studying stress on mothers and fathers (two independent samples). An important variable in my study is the number of children (usually mothers or fathers with more children feel more stress) and my subjects have between 1, 2 or 3 children. However because I am only looking for gender (mothers vs fathers) differences in stress I want to control this variable (number of children) in my sample. Thus, I decided to apply a chi-square analysis to see if my sample of mothers differed from my sample of fathers in the number of children. The test is non significant so i assumed that my two independent samples do not differ in number of children and that number of children is not a covariate when comparing the mothers and fathers of my sample. Is this correct?

Thanks a lot*

Very helpful. Thanks

Hi Karen,

Maybe you could clarify something for me. I have a model that consist of 3 independent variables and on dependent variable. However, in my research I identify two other concepts that acts as mediators (social exchange and perceived organizational support). Would these concepts be considered as covariates?

Thanks

Hi Lee,

Mediators are a little different than covariates. See this: http://www.theanalysisfactor.com/five-common-relationships-among-three-variables-in-a-statistical-model/

Hi Karen,

I found your review of covariates really helpful. My research involves exploring the impact of anxiety on communication. I’m using a transmission chain methodology where one person reads a story and reproduces that for the next person in the chain, who reproduces it for the next person in the chain and so on until 4 people have read and reproduced the story. I used a mood induction procedure, which is my between subjects factor. Typically these studies classify the ‘generations’ in the transmission chain as a within subjects factor as the output of person 2-4 is dependent on what they received from the previous person (as if you are taken measurements at different points in time). I am measuring the number of positive and negative statements produced by each person in their reproductions. So what I have done is a 2x4x2 ANOVA.

What I also want to do is explore the impact of trait anxiety, which I measured prior to testing and is a continuous variable. Is it possible to enter trait anxiety as a covariate in an ANOVA/ANCOVA to determine if trait anxiety was related to/or differentially impacted performance under each condition? The hypothesis is that high trait anxiety participants, under a negative mood induction would show a different pattern of results to low trait anxious.

I can perform a median split and enter it as a between subjects factor but this will result in a small number of observations, particularly once I try to look at more than one generation and I’m worried about the loss of power.

I would really appreciate your advice.

All the best,

Keely

I’m having the same problem 🙁

Dear Karen,

i was wondering if a covariate is the same as the mediator in a repeated measurement model? I am using it that way.

But maybe there is another way to test a mediation?

Thanks in advance for your help

Hi Karen,

What does the phrase “covary out the effects” mean? Your article has been very helpful in helping to clear up confusion with terms!! But I’m still at a loss as to why this phrase was used in the context of a test evaluation. In discussing potential changes to a control group (outside the bounds of the test), we were told they could “covary out the effects” of a change.

Hi Liz,

Great question, and one that will take another article to describe. 🙂 I’ll add that to my list of future articles to write.

Hi,

I am new to this site and this has always confused me. How do you “control” for a variable? Could you please explain, how “controlling “works? One of the ways to control a variable might be just taking random samples, i.e., if you want to control for age, then take a rs from all age groups. Another example would be like a “control group” (placebo in a drug experiment).

Am I correct in the above examples? Also, I assume there are other standard techniques, could you please clarify how they work?

Hi RK,

That’s a big question. Or rather, a small question with a big answer. I will see if I can write a post (or two or 10) explaining it.

Thanks Karen! Looking forward to it!!

Hi Karen,

Thanks for this article! I just want to further clarify some of the discussion on dichotomous covariates/fixed factors. Specifically, I am running a chi squared with one dichotomous outcome and one dichotomous predictor- too see how well group membership of the two-level predictor discriminates groups membership of the two level outcome. Now, I want to ensure that gender (I think my covariate/fixed factor) does not modify the relationship of my predictor’s ability to discriminate membership of my outcome variable. To answer this question I plan to run the model where both genders are combined and separately for each gender. The answer is that discriminate ability of my predictor does depend on gender to some extent.

So my question is – is gender a fixed factor here? is there a better word for it? Random factor?

Thank you so much!

Alyssa

Hi Alyssa,

Within the context of SPSS GLM, Gender is a fixed factor. Don’t make it random–that’s a whole other thing!

But if you’re doing a chi-square, Fixed Factor and covariate aren’t really issues. Just add it in as another variable

Easy to understand definition of Covariate for ANCOVA. Thank you!

Hi,

I’m running an ANOVA with repeated measures. As I can see from the correllation matrix there are significant correllations between my possible cofounding variables and my dependent variable but here is my problem: if the possible co-variate is correllating with the dep. variable at time2 but not at time1, do I have to include it as a “normal” co-variate when I perform the GLM?

Thanks for any help.

Anika

It would be a good idea to include a covariate*time interaction. That will allow the effect of the covariate to be different at time 1 and time 2.

Hi Karen,

I have a follow-up question please. I am looking at memory performance in young and older adults under two conditions: when a negative or a positive stereotype is activated. I’m using a mixed model to compare performance across time (before/after the intervention) across the two age groups and conditions.

I obtain a significant difference between the two age groups on levels of verbal IQ (NART scores) which is continuous, and hence a covariate (that exerts a significant effect).

The paper that I am basing my study on also obtains a difference between age groups over verbal IQ scores. They do not obtain a significant difference within age groups but between conditions, however, and so have not included it as a covariate. This seems wrong to me. Surely if a significant difference over a background variable occurs on one of your IVs you should include the covariate in the model, regardless of whether there’s a difference between groups on the second IV?

If you could clear this up for me I’d really appreciate it, as I am confused!

Thanks,

Joanne

Hi Joanne,

I am missing something. Does verbal IQ relate to the DV? (It seems it would but you don’t mention that). So in this paper, the two stereotype condition groups have different verbal IQ scores, but age groups didn’t? It really comes down to whether the potential covariate is related to the DV.

Hi Karen,

Thanks for the information. The way you write it very clear and easy to understand. Thanks~

Thanks!

So can we say based on your answer that every confounding variable is a covariate but not every covariate is a confounding variable?

Zahir

Depends on how you’re using them. 🙂

What is the difference between a confound and a covariate in simple terms?

Alexa, that’s a really great question. A confound is a variable that is perfectly (or so near perfectly that you can’t distinguish) associated with another variable that you can’t tell their effects apart.

In most areas of the US, for example, neighborhood and school attended overlap so much because most kids from a neighborhood all go to the same school. So you couldn’t separate out the school effects on say, grade 3 test scores, from the neighborhood effects.

A covariate is a variable that affects the DV in addition to the IV. It doesn’t have to be correlated with the independent variable. If not, it may just explain some of the otherwise unexplained variation in the DV.

Karen

Thanks Karen! I’m so happy this website exists.

I found this page because I am stuck on something related but at a way lower level (I am no statistician). I’m running a repeated measures ANOVA in SPSS, using GLM. I need to control for a between-subjects categorical variable that might be adding noise to the data and washing out any effects of my factors. I can’t figure out if the right way to do this is to put it in as a between-subjects factor, or pretend that is a continuous variable and put it in the covariate box. Can you help?

Hi Emily,

Yes. In fact, there is already an article here on that exact topic. It’s the same in all SPSS glm procedures, whether you’re using univarate, repeated measures, etc. http://www.theanalysisfactor.com/spss-glm-choosing-fixed-factors-and-covariates/

Karen

Hi Karen

I’m still a little confused on the same issue as Emily, despite reading the article you suggested. Although the suggested article seems to clearly spell out that any true categorical variable, including dichotomous variables, should be included in an ANOVA model as a fixed factor rather than a covariate, the article above states

“In our little math training example, you may be unable to pretest the participants. Maybe you can only get them for one session. But it’s quick and easy to ask them, even after the test, “Did you take Calculus?” It’s not as good of a control variable as a pretest score, but you can at least get at their previous math training…You’d use it in the model in the exact same way you would the pretest score. You’d just have to define it as categorical.”

It’s the last sentence that I get stuck on – because if you included a dichotomous variable in the model in the exact same way you would a continuous pre-test score, you would include it as a covariate, not as a fixed factor.

Sorry if this seems obvious, perhaps I’m getting caught up in the terminology too! Any clarification would be greatly appreciated, I’ve found your explanations to be more helpful than most!

Hi Debra,

SPSS’s definitition of “Covariate” is “continuous predictor variable.” It’s definition of “fixed factor” is categorical predictor variable. (There’s actually more to this in comparing fixed and random factors, but that’s a tangent here).

“It’s the last sentence that I get stuck on – because if you included a dichotomous variable in the model in the exact same way you would a continuous pre-test score, you would include it as a covariate, not as a fixed factor. ”

I don’t mean

defineit the same way, I meanuse it as a control variablein the same way. I’m trying to separate out the use of the variable in the model (as something to control for vs. something about whose effect you have a hypothesis) from the way it was measured and therefore needs to be defined (categorical vs. continuous).So yes, it goes into Fixed Factors because it’s categorical.

And don’t apologize for getting confused with terminology–that’s my whole point. The inprecision of the terminology is what makes it so confusing! 🙂

Karen

Hello Karen

I was going through the discussion and had same confusion. my study evaluates effectiveness of a school based program on preschoolers behavior problems. my intervention and control groups differ on strength of students in class measured in categories and fathers education also measured in categories. I need to see if ANCOVA results remain significant after controlling for these baseline differences. the covariate option in SPSS should be a continues measure. So how should I do the analysis with categorical variable. Please answer soon.

Pls what is the relationship between covariate partial eta squared in ancova result and partial eta squared of treatment and moderator variables. what is the implication of that of covariate which can be the pretest being higher than that of main treatment or moderator variable

Hi Fasasi,

I don’t know–I’d need more information on the model. For example, is the covariate different from the moderator? Which interactions are being included?

Thanks,

Karen

Thanks a lot Caren. Your notes were very helpful. I have been looking for the answers in tens of books for several months. Thank God, many of my uncertainties on GLM command are answered today in your site. FYI, in my area of study (accountancy), GLM command is almost nonexistent in literatures.

Hi Nur,

I’m so glad.

Most of the time in the literature it will be called ANOVA, ANCOVA or linear regression. But they’re all the same model and all can be run in GLM.

Karen

Very clear. Found the answer I was looking for. Thanks much! I’ll pass on word of your site.

Thanks, Ben! Glad it was helpful.

Karen

Hi Akinboboye,

I would suggest starting with the SPSS category link at the right.

If you need more help at the beginning level, I’d be happy to send you my book (I have a few extra copies) for the cost of shipping. Please email me directly.

If you want more help with the concepts discussed above, you really want the Running Regressions and ANCOVAs in SPSS GLM workshop. It walks you through the univariate GLM procedure step-by-step and shows where it’s the same and where it’s different from the regression procedure. You can get to that here: http://www.theanalysisinstitute.com/workshops/SPSS-GLM/index.html

It’s not running right now, but you can use our contact form to get access to it as a home study workshop.

Best,

Karen

I really enjoy this write-up. Kindly send me details on how to use spss. Thanks

Hi Ayesha,

If a control variable is simply a categorical variable, put it into “Fixed Factors” instead of Covariate in Univariate GLM. By default, SPSS will also add in an interaction term, but you can take that out in the Design dialog box.

fyi, if it’s helpful, we have a workshop available on demand that goes through all these details of SPSS GLM: http://theanalysisinstitute.com/spss-glm-ondemand-workshop/

{ 3 trackbacks }