In statistical practice, there are many situations where best practices are clear. There are many, though, where they aren’t. The granddaddy of these practices is adjusting p-values when you make multiple comparisons. There are good reasons to do it and good reasons not to. It depends on the situation.
At the heart of the issue is a concept called Family-wise Error Rate (FWER). FWER is the probability that
you will get at least one Type I error, in a set (or family) of tests (Tukey, 1953). Recall that a Type I error is when you reject your null hypothesis even though it was true.
Remember, when you choose alpha = .05, you are accepting a 1 in 20 likelihood the test will have a p-value lower than .05, even if the null hypothesis is true.
This is all well and good. We can live with this. The trouble comes when you run another statistical test. And another. And another. It turns out, each time you run a test, the chance of getting at least one Type I error goes up.
Why do multiple comparisons increase family-wise error?
Here is an analogy. Let’s say you are playing a board game where you roll one die to move, and you need a certain specific number to move on the board. Maybe it is the number 6.
What is the probability that you will roll the number 6 at least once? Mathematically, this is a ratio of the desired event to the number of possible events. In this case, the probability of rolling a 6 on a 6-sided die on each roll is:
1/6 = .166
Another way of putting this is, there are 5/6 possibilities of NOT getting a 6 in one roll, or 5/6 = .833.
But what if the game gives you 4 chances? Now how likely are you to roll at least one 6? Intuitively, you probably understand that the chance of rolling at least one 6 increases with each roll of the die. This is true even though the probability of rolling a 6 on each roll stays the same. Most of us have some practical experience with this after all. The math would look like this:
For 4 rolls, P (no 6) = (⅚)4 = 0.482253 (To the 4th power – the probability increases exponentially with each roll.)
And, the P (at least one 6) = 1 – P (no 6)
Substituting the values, we get:
= 1 – 0.482253
So, essentially, you have a 52% chance of rolling a 6 at least once in 4 rolls.
See where I am going with this?
In data analysis, it’s the same thing. The more tests you run, the higher the probability of at least one Type I error. Even if alpha stays the same on each test.
Let’s say you are a psychologist using a national data set on adolescents. You have several hypotheses you want to test about teen problems and parental engagement. In the end, you decide to run a series of analyses to test the following hypotheses:
- Teens with a curfew will be less likely to use heavy drugs (LSD, mushrooms, crack, heroine) than those without a curfew.
- Teens with a curfew will report greater happiness than those without a curfew.
- Teens having dinner with at least one parent 3 or more days a week will be less likely to use heavy drugs (LSD, mushrooms, crack, heroine) than those having fewer than 3 dinners per week with a parent.
- Teens having dinner with at least one parent 3 or more days a week will report greater happiness than those having fewer than 3 dinners per week with a parent.
These are perfectly sound hypotheses, and it’s worth running these tests. Just keep in mind that by the time you test hypothesis #4, the chance of making a Type 1 error at least once in your tests isn’t .05. It’s 1 – .954 = .185.
And with 20 different tests? 1 – .9520 = .642.
Some people never adjust, some people adjust as a rule. In the example above, the researcher might decide to report any finding that is significant at p < .05 but highlight the need for replication in their conclusions. On the other hand, if the researcher chooses to run even more hypothesis tests, they might decide to adjust for FWER to reduce the very high probability of false positives.
Deciding if it’s Helpful to Consider Family-wise Error Rate
There are ways to control Family-wise Error Rate down to .05, but it comes at the expense of increased Type II error and lower power.
Here are a few questions to ask yourself as you’re making decisions:
- Is it really problematic if you make even one Type I error?
- What are the costs and benefits of Type I and Type II errors in this research situation?
- Are you the right person to evaluate this tradeoff between errors or should you just summarize the data (providing estimates and confidence intervals) so each reader can come to their own conclusions?
Regardless of what you decide, it’s important to understand the issues and to take some time to stop and think about what is best for this particular research project. Professors, reviewers, and colleagues may be adamant there’s only one right way to do things, but it’s not that simple.
The best way to do things depends on how people are going to use and interpret your study results- and this will vary from one study to the next. As much as we want statistical tests to give us a straight answer, there is always uncertainty. After all, this is one of the reasons we look for replication of results across studies.