You may have never heard of listwise deletion for missing data, but you’ve probably used it.
Listwise deletion means that any individual in a data set is deleted from an analysis if they’re missing data on any variable in the analysis.
It’s the default in most software packages.
Although the simplicity of it is a major advantage, it causes big problems in many missing data situations.
But not always. If you happen to have one of the uncommon missing data situations in which listwise deletion doesn’t cause problems, it’s a reasonable solution.
You hear a lot about its problems because most data sets don’t fit two conditions that must hold for listwise deletion to work well.
So let’s talk about those two conditions and what the problems are when they’re not met.
When Listwise Deletion Works
The Data are Missing Completely at Random
When the incomplete cases that are dropped differ from the complete cases still in the sample, then the carefully selected random sample is no longer reflective of the entire population.
You’ve now got a biased sample and biased results. That’s not good.
You can’t trust those results to be reflective of the population.
But sometimes the cases with missing data are no different than the complete cases—they are a purely random subset of the data. This is called Missing Completely at Random (MCAR).
If this holds, there won’t be any bias in analyses based on complete cases.
You have sufficient power anyway, even though you lost part of your data set
Dropping more than a few cases from a data set can have dramatic consequences for sample size. Since statistical power is directly tied to sample size, losing one results in losing the other.
But listwise deletion doesn’t always drop so many cases to adversely affect power. If the percentage of missing data is very small or you had an overly large sample to begin with, you may still have adequate power to detect meaningful effects.
There is one caveat here though. It’s possible to have only a small percentage of observations missing overall, yet still lose a large part of the sample to listwise deletion. This is the situation that’s most problematic for listwise deletion.
This happens when an analysis includes many variables, and each is missing for a few unique cases. Say you have a data set with 200 observations and use 10 variables in a regression model. If each variable is missing on the same 10 cases, you end up with 190 complete cases, 5% missing. Not bad.
But if you have a different 10 cases missing on each variable, you will lose 100 cases (10 cases by 10 variables). With only 5% missing data, you end up with 100 complete cases, 50% missing. Not so good.
How to Tell if Listwise Deletion is Reasonable
Before you just assume that listwise deletion is an adequate approach, it is important to establish that these two conditions are met.
Spend some time doing missing data diagnosis to understand patterns and randomness of missingness. Like testing assumptions in linear models, there isn’t one definitive test to tell you if assumptions are met for listwise deletion. It’s more an exercise in gathering evidence that assumptions aren’t clearly violated.