Updated Nov 4, 2020 to add more detail
In your statistics class, your professor made a big deal about unequal sample sizes in one-way Analysis of Variance (ANOVA) for two reasons.
1. Because she was making you calculate everything by hand. Sums of squares require a different formula* if sample sizes are unequal, but statistical software will automatically use the right formula. So we’re not too concerned. We’re definitely using software.
2. Nice properties in ANOVA such as the Grand Mean being the intercept in an effect-coded regression model don’t hold when data are unbalanced. Instead of the grand mean, you need to use a weighted mean. That’s not a big deal if you’re aware of it.
But there are a few real issues with unequal sample sizes in ANOVA. They don’t invalidate an analysis, but it’s important to be aware of them as you’re interpreting your output.
Two Practical Issues for Unequal Sample Sizes in One-Way ANOVA
1. Assumption Robustness with Unequal Samples
The main practical issue in one-way ANOVA is that unequal sample sizes affect the robustness of the equal variance assumption.
ANOVA is considered robust to moderate departures from this assumption. But that’s not true when the sample sizes are very different. According to Keppel (1993), there is no good rule of thumb for how unequal the sample sizes need to be for heterogeneity of variance to be a problem.
So if you have equal variances in your groups and unequal sample sizes, no problem. If you have unequal variances and equal sample sizes, no problem.
The only problem is if you have unequal variances and unequal sample sizes.
2. Power with Unequal samples
The statistical power of a hypothesis test that compares groups is highest when groups have equal sample sizes.
Power is based on the smallest sample size, so while it doesn’t hurt power to have more observations in the larger group, it doesn’t help either.
So if you have a specific number of individuals to randomly assign to groups, you’ll have the most power if you assign them equally.
If your grouping is a natural one, you’re not making decisions based on a total number of individuals. It’s very common to just happen to get a larger sample of one group compared to the others.
That doesn’t bias your test or give you incorrect results. It just means the power you have is based on the smaller sample.
So if you have 30 individuals with Treatment A and 40 individuals with Treatment B and 300 controls, that’s fine. It’s just that you could have stopped with 30 controls. The extra 270 didn’t help the power of this particular test.
Yes, this all holds true for independent samples t-tests
Independent samples t-tests are essentially a simplificiation of a one-way ANOVA for only two groups. In fact, if you run your t-test as an ANOVA, you’ll get the same p-value. And the between-groups F statistic will be the square of the t statistic you got in your t-test.
(Really, try it…. pretty cool, huh?)
This means they work the same way. Unbalanced t-tests have the same practical issues with unequal samples, but it doesn’t otherwise affect the validity or bias in the test.
Problems in Factorial ANOVA
Factorial ANOVA includes all those ANOVA models with more than one crossed factor. It generally involves one or more interaction terms.
Real issues with unequal sample sizes do occur in factorial ANOVA in one situation: when the sample sizes are confounded in the two (or more) factors. Let’s unpack this.
For example, in a two-way ANOVA, let’s say that your two independent variables (factors) are Age (young vs. old) and Marital Status (married vs. not).
Let’s say there are twice as many young people as old. So unequal sample sizes.
And say the younger group has a much larger percentage of singles than the older group. In other words, the two factors are not independent of each other. The effect of marital status cannot be distinguished from the effect of age.
So you may get a big mean difference between the marital statuses, but it’s really being driven by age.
What about Chi Square Tests?
(This article is about ANOVA (and t-tests), but I’ve updated to include Chi-Square tests after getting a lot of questions).
There are a number of different chi-square tests, but the two that can seem concerning in this context are the Chi-Square Test of Independence and The Chi-Square Test of Homogeneity. Both have two categorical variables. Both count the the frequencies of the combinations of these categories.
They calculate the test statistic the same way. Without getting into the math, it’s basically a comparison of the actual frequencies of the combinations with the frequencies you’d expect under the null hypothesis.
And luckily, unequal sample sizes do not affect the ability to calculate that chi-square test statistic. It’s pretty rare to have equal sample sizes, in fact. The expected values take the sample sizes into account. So no problems at all here.
That said, when there is a third variable involved, you can have an issue with Simpson’s Paradox. You may or may not have collected that third variable, so it’s worth thinking about whether there could be something else that is creating an association in a combination of two groups of that third variable that doesn’t exist in each group alone.
But that’s not really an issue with unequal sample sizes. That’s an issue of omitting an important variable from an analysis.