In your statistics class, your professor made a big deal about unequal sample sizes in one-way Analysis of Variance (ANOVA) for two reasons.
1. Because she was making you calculate everything by hand. Sums of squares require a different formula if sample sizes are unequal, but SPSS (and other statistical software) will automatically use the right formula.
2. Nice properties in ANOVA such as the Grand Mean being the intercept in an effect-coded regression model don’t hold when data are unbalanced. Instead of the grand mean, you need to use a weighted mean. That’s not a big deal if you’re aware of it.
The only practical issue in one-way ANOVA is that very unequal sample sizes can affect the homogeneity of variance assumption. ANOVA is considered robust to moderate departures from this assumption, but the departure needs to stay smaller when the sample sizes are very different. According to Keppel (2003), there isn’t a good rule of thumb for the point at which unequal sample sizes make heterogeneity of variance a problem.
Real issues with unequal sample sizes do occur in factorial ANOVA, if the sample sizes are confounded in the two (or more) factors. For example, in a two-way ANOVA, let’s say that your two independent variables (factors) are age (young vs. old) and marital status (married vs. not). If there are twice as many young people as old and the young group has a much larger percentage of singles than the older group, the effect of marital status cannot be distinguished from the effect of age.
Power is based on the smallest sample size, so while it doesn’t hurt power to have more observations in the larger group, it doesn’t help either.
A very comprehensive article with more information about ANOVA in general and how sample sizes affect it is at http://www2.chass.ncsu.edu/garson/PA765/anova.htm.





{ 33 comments… read them below or add one }
This may be a silly questions, but what if you are doing a 2x2x2 and your comparing males and females on their reaction times (2 tasks) and their anxiety (high or low)
and there are more females in the study than males.
Would this be a confound?
Hi Jessica,
It’s not a confound just if there are more females than males. It’s a confound only if, say, there are more females AND females are more likely to be anxious.
If your task and anxiety conditions are manipulated, so that you’re assigning people to them, then you have no problem. The example I gave could only occur if you also measured anxiety, not manipulated it.
Karen
I apologize in advance but I have bunch of questions about unequal sample sizes and one-way ANOVAs in a particular case study.
I am conducting an experiment with very different numbers of sample sites. control: n=60; dose 1 n=114; dose 2 n=175. My main question is if the response to dose 1 and dose 2 are significantly different? Response was measured by difference in plant’s presence or absence before and after treatment. So if the the plant was present at the sample site before treatment and absent at the same site after treatment it was considered a 1 for response, if it was present before and it was present after it was considered a 0 for response and if the plant was not present before and was present after it was considered -1 (I already know from previous research that the two doses should be significantly different from the control but I would like to do an ANOVA test to compare the resposes in the control group and the two different doses)
Q1: Is there a test, like the levene test, for determining the equality of variances for unequal sample sizes?
Q2: Should I not use ANOVA because the sample sizes are too different?
Q3: Would it be better or worse to conduct a series of t-tests?
Q4: If I choose to use ANOVA should I use a Welch ANOVA followed by games howell pairwise comparison as suggested here in the below pdf because the sample sizes are different? http://frank.mtsu.edu/~dkfuller/notes302/anova.pdf
Q5: Should I not use ANOVA or a t-test because I pretty sure the data is not gaussian due to the fact that the data is practically boolean? And if so is there another test for comparing this kind of data?
Any help you could give me would be greatly appreciated. I feel pretty lost.
Thank you in advance,
Erika
Hi Erika,
Sorry it took me a while to respond. Hope this is still useful. You do have a lot of questions, but I’ll do my best.
1. Levene works with unequal samples sizes. Equal variance is even MORE important if sample sizes are unequal.
2. No. It’s fine to use ANOVA (assuming variances are equal) with unequal sample sizes. But you should NOT use ANOVA in this study because your response is categorical, not continuous.
3. Worse. Always worse.
4. Welch’s test could work in your design (if ANOVA were appropriate), but according to Keppel (1991), it’s “unsatisfactory” when you’re comparing more than 4 means.
5. Exactly. You could just run a Chi-square, or if you want to get really fancy, or you have covariates you want to include, a logistic regression.
Karen
Thank you Karen! You really helped to clear these things up for me. I really appreciate it. Sorry again for all of the questions.
Thanks Again,
Erika
Hi Karen,
I conducted a two-way ANOVA to test if there are differences in levels of teaching innovation (scores 0-6) between teachers based on school (1=regular school, 2=all-day school) and in-service training (1=none, 2=Basic ICT Skills, 3=Educational applications of ICT). I used unequal sample sizes (75 all-day teachers and 90 regular teachers).
The ANOVA table showed that there are no differences in either main effects or interaction effect (p<0.05). However, the Model p-value was smaller than 0.05 showing that there are significant differences in the model.
I discussed only the interaction and main effects p-values. My chair told me to recheck the data analysis because it does not make sense with the Model having significant differences whereas none of the effects (main and interaction) had no significant differences. When I deleted the Model row from the table claiming that the only important p-values to discuss were the main and interaction effects p-values, my chair said this was wrong.
The data analysis is correct–I double checked it. My question is: What does this Model p-value mean? Does it have to do with the unequal sample sizes? How should I discuss this Model p-value? Is it really this important to include it in my results?
Thanks in advance,
Adamantia
Hi Adamantia,
Thanks for being patient–I’ve been out of the office and just got back.
I can’t give you a definite answer of what is going on without trying it on data, but this is what is *probably* going on.
The Model p-value evaluates the overall effect of all IVs. IF all the IVs are completely independent and sample sizes are equal, the overall model effect won’t be significant if no IVs are.
IVs are usually only independent when you have randomly assigned subjects to conditions.
The other thing that can happen is if your p-values are close to .05, different tests might be falling on one side of that cutoff or the other. They’re not really changing much, and even just rounding can be creating differences. So if that’s the case, don’t take the .05 cutoff too seriously.
Karen
Hi there,
I have some data that gives the amount of time taken by three different surgeons to undertake a specific procedure. Given that I have a varying number of data points for each surgeon (e.g. 50/40/25) and that there may be unequal variance (e.g. slower surgeons having a greater variety of recorded times), what is the best way to figure out if there are significant differences in the time taken by each surgeon?
Cheers!
Hi Paulo,
I would start by seeing if the unequal variances are large enough to cause problems. If they are, with a one-way analysis like that, you could easily just run a nonparametric test.
Karen
good day! this is very urgent…. we have a report to pass tomorrow and our research design is two-way (2×2) anova factorial design. we dont know how to make results in spss. thank you!
hello again, we have unequal sample size.. thank you again!
Hi Vanna,
Hmm, we may be past your deadline anyway, but in any case, I’d need more information about what you need. The fact that you have unequal sample sizes in the ANOVA isn’t problematic. Just run it as you would any 2×2 ANOVA. If you need help running a 2×2 ANOVA in SPSS, I can tell you to use Univariate GLM. If you need more detail than that, I need a better idea of what you understand already and what you need help with.
Karen
hi in my test am comparing a single variable among 3 groups having different sample size. can i do one way ANOVA inspite of the unequal sample size?
Sure.
Hi Karen, I am interested in your 3rd response to Erika the 7th December 2010. You write Erika should not use ANOVA as the response is categorical, not continuous.
Do you mean that because of Erika’s design, a control: n=60; dose 1 n=114; and dose 2 n=175, it is inappropriate to use ANOVA here?
I am interested in this as I have similar conditions; a grouping variable with three categorical (depending on viewpoint) responses, a (very) unbalanced design, and for some dependents, unequal variances. Hence, I wonder, what analysis would be appropriate if I conclude my response is categorical, rather than a continuous?
Best Bud
Hi Bud,
Good question. No, the control/Dose 1/Dose 2 variable is her Independent Variable. It’s totally appropriate to have grouping (ie. categorical) variables for the independent variable.
In Erika’s study, her Dependent Variable (aka Response Variable or Outcome) is ALSO categorical: Is the Presence of the plant the same after as it was before: Yes or No.
ANOVA is comparing means in the Dependent variable for the different categories of the Independent Variable. Since there is no way to calculate a mean of Yes/No, you can’t use anova.
So I’m not sure based on how you’ve described your study whether your dependent variable is indeed categorical. You mention unequal variances, which makes me think they really are numerical.
Here are a few posts that might be helpful:
When Dependent Variables Are Not Fit for GLM, Now What?
6 Types of Dependent Variables that will Never Meet the GLM Normality Assumption
Karen
Hi again Karen, and thanks!
I see your point, her (Erika’s) dependent was a categorical. Mine, however, are not. That I am sure of. However, a greater concern for me is that my sample sizes vary considerably: group 1 equals 464, group 2 = 444, and group 3 = 24.
My problem is that even though an ANOVA shows significant differences for the three groups on a specific dependent variable, and the largest calculated mean-difference is between the smallest group and one of the other, post hoc tests cannot tell apart the smallest group from the group where the largest mean-difference appear.
Standardized mean scores for the groups:
Group 1: -.08 (a)
Group 2: .27 (b)
Group 3: -.11 (ab)
Currently, I use Hochberg’s GT2 post hoc test, as it, I have read, is quite robust to violations of homogeneity of variance. I also, where indicated by the Levene’s test, modify the p-values using the Welch modification.
I know that this may be a lot to ask but I wonder whether you think I could benefit from bootstrapping or if such a procedure will not help me as the ratio among my three groups will not differ?
Best Viktor
Hi Karen,
I have run a two way ANOVA (2 by 2 facotrial design) and gained a significant Levene’s Test p = .012. I have adjusted the crititcal alpha for interpretation of significance for both the main and interaction effects, however I was wondering what are the practical methods that can be used in future studies such that Levene’s is not violated? and are you able to give me some references.
Also, with another 2 by 2 factorial design that reveals a significant interaction, I am aware that follow up simple effects are required. Through the use of the split data method in SPSS and recalculated the F statistic using the overall MSE. Is there a need to control for Type 1 error by using Bonferroni’s?
Thanks
Nicole
hi Nicole,
I never use Levene’s test. With large sample sizes, it’s almost always significant. With small sample sizes, it’s almost never significant.
So it’s not very helpful. Geoffrey Keppel’s book Design and Analysis of Experiments has a good section on this.
Or if you want a full explanation and demonstration about assumptions, what they mean and better ways to check them, I would actually recommend my workshop on assumptions in linear models. We have a home study version and you can get more information at: http://www.theanalysisinstitute.com/workshops/GLM-Assumptions/index.html
Karen
HI Karen,
I just want to know if i could actually use two way factorial anova for this.
I have two groups of DEvice 1 (n=27) and Device 2 (n=28). in each group, I have 5 sub categories of participants (very low, low, moderate, high and very high experience of playing games). For the Device 1 group I have 9, 8, 5, 2, 2 and 1 for ach sub category. For the Device 2, I have 7, 4, 7, 8, 2, 0 for each sub category. Can I use two way ANOVA for this? Or should I just provide descriptive analysis? The main objective of the experiment is to see if there is any difference on the participants total score when playing games in Device 1 or 2.
Hi Lulu,
You could run a two-way anova as is without the interaction on this. The problem subcategories are the ones with 1 and 0 people in them.
The other alternative, if the interaction seems necessary, is to collapse the experience variable into fewer categories.
I would suggest graphing the means to see if the interaction is important, and if not, leave it out. If it is, you’d be better of collapsing.
Karen
Hi KAREN.
I would like to know if how are the different letter superscripts used in a post-hoc test? can you suggest a reading material with examples on when and how to use different letter superscripts when 7 treatments are considered, and the level of significance vary in at least 4 of the paired means.
thanks!
jay
Hi Jay,
The way it works is that any means that are NOT significantly different in the post-hoc tests get the same letter superscript.
Let’s say the post-hoc results were simple, where M1 indicates the mean of group 1:
M3 < M1=M2=M4=M6 < M5=M7
They would be labelled this way in the table (sorry, I can’t get the superscripts in the comments, so just pretend the letter are up):
M1a
M2a
M3b
M4a
M5c
M6a
M7c
When it gets tricky is when there’s overlap, which is very common with 7 groups. So let’s say for example, we have this more complicated example:
M3 < M5=M7
M3 = M1=M2=M4=M6
M1=M2=M4=M6 = M5=M7
So the highest and lowest means are significantly different from each other, but the ones in the middle don’t differ significantly from anything. The means would be labelled like this:
M1a,b
M2a,b
M3a
M4a,b
M5b
M6a,b
M7b
So M3 are in a different group than M5 and M7. But M1, for example, has the same subscript as both M3 and M5 because it overlaps them.
Hope that helps!
Karen
Hi Karen, I was just wondering if you have any suggestion as to how to further interpret findings if the variance is unequal (Levene is highly significant, groups are large >300) when conducting an ANCOVA in SPSS. There seems to be no way to obtain Welch or Hochberg when a covariate is included (age)…Do you have any suggestions?
Kind regards Sindre
Hi Sindre,
You could always run a weighted least squares model. You can put the weight in either GLM or Regression procedures.
Karen
Hi, Karen,
I noticed that you replied to a person you never used Levene’s test. So, I just wondered how you test the homogeneity of variance as a stat consult, since Levene’s test is known as being affected by the sample size.
Another question is I’m working on a project involved one way ANOVA. Basically, we want to compare students’ outcome under seven different instruction methods. Since we have unequal sample sizes, the way we chose to analyze the data is we test the homogeneity of variance first, if the assumption is met, we go with normal ANOVA F and Tukey as post hoc. If the assumption is not met, we go with Welch F plus Games-Howell as post hoc. Is this way correct?
Any thoughts are greatly appreciated! Thank you.
Ming
Hello Karen,
I wonder if you can help:
I’ve conducted 8 2x2x2x2-way between-subjects ANOVAs. The sample size is 179. There are approximately similar numbers of participants in each level of the independent variables and across the 16 combinations of the IVs.
I have 8 dependent variables. For some of the ANOVAs, the Levene’s Test is significant, for others it is not. When it is significant, I have used all expected ways of transforming the DV, without success.
Because the Levene’s test is sometimes not significant on the same sample, does this mean that the comment you made above about Levene’s sometimes being significant in large samples does not apply to my data? (Would 179 participants be considered a large sample?)
There is no alternative non-parametric test for a 4-way ANOVA, so I’m unsure what to do. Any advice you could offer would be greatly appreciated.
Kind regards
Fay
Hi Fay,
I would suggest using other ways to check for non-constant variance other than Levene’s. I don’t think you have an issue of a sample being too big–you’ve got only slightly more than 10 per condition.
And transformations are really only useful for non-constant variance if you also have non-normality. Otherwise the normality will be messed up. (that’s a technical term).
I have a whole workshop on this, which you might want to look into: Assumptions of Linear Models
Best,
Karen
Hi, Karen,
I noticed that you replied to a person you never used Levene’s test. So, I just wondered how you test the homogeneity of variance as a stat consult, since Levene’s test is known as being affected by the sample size.
Another question is I’m working on a project involved one way ANOVA. Basically, we want to compare students’ outcome under seven different instruction methods. Since we have unequal sample sizes, the way we chose to analyze the data is we test the homogeneity of variance first, if the assumption is met, we go with normal ANOVA F and Tukey as post hoc. If the assumption is not met, we go with Welch F plus Games-Howell as post hoc. Is this way correct?
Any thoughts are greatly appreciated! Thank you.
Ming
Hey Karen, thank you for posting this article and for taking the time to respond to so many of your readers’ questions. I’m really impressed with that and will be checking out more of your website. Cheers
Thanks, Remy, for the kind words.
Karen
Hi Karen,
I have a data set with one control and two treatments. Basically, three groups of cows were fed a control diet, a contaminated diet, and a contaminated diet with additive. I have the following samples sized: control = 2 cows, trt1 = 5 cows, and trt2= 5 cows. These are pretty small numbers to begin with, but we were limited by money (yay research!). I’m using Proc Mixed in SAS for the ANOVA, but after reading some of the comments above, I’m not sure I’ve done this correctly. Can you offer some advice on the proper way to analyze this data?
Thanks
Stephanie
Hi Stephanie,
Proc Mixed might be fine–you haven’t given me enough information. Is there a reason for mixed, like repeated measures on each cow or randomized blocks?
Karen