When Listwise Deletion works for Missing Data

You may have never heard of listwise deletion for missing data, but you’ve probably used it.

Listwise deletion means that any individual in a data set is deleted from an analysis if they’re missing data on any variable in the analysis.

It’s the default in most software packages.

Although the simplicity of it is a major advantage, it causes big problems in many missing data situations.

But not always.  If you happen to have one of the uncommon missing data situations in which listwise deletion doesn’t cause problems, it’s a reasonable solution.

You hear a lot about its problems because most data sets don’t fit two conditions that must hold for listwise deletion to work well.

So let’s talk about those two conditions and what the problems are when they’re not met.

When Listwise Deletion Works

  1. The Data are Missing Completely at Random

When the incomplete cases that are dropped differ from the complete cases still in the sample, then the carefully selected random sample is no longer reflective of the entire population.

You’ve now got a biased sample and biased results.  That’s not good.

You can’t trust those results to be reflective of the population.

But sometimes the cases with missing data are no different than the complete cases—they are a purely random subset of the data.  This is called Missing Completely at Random (MCAR).

If this holds, there won’t be any bias in analyses based on complete cases.

  1. You have sufficient power anyway, even though you lost part of your data set

Dropping more than a few cases from a data set can have dramatic consequences for sample size.  Since statistical power is directly tied to sample size, losing one results in losing the other.

But listwise deletion doesn’t always drop so many cases to adversely affect power.  If the percentage of missing data is very small or you had an overly large sample to begin with, you may still have adequate power to detect meaningful effects.

There is one caveat here though.  It’s possible to have only a small percentage of observations missing overall, yet still lose a large part of the sample to listwise deletion.  This is the situation that’s most problematic for listwise deletion.

This happens when an analysis includes many variables, and each is missing for a few unique cases.  Say you have a data set with 200 observations and use 10 variables in a regression model.  If each variable is missing on the same 10 cases, you end up with 190 complete cases, 5% missing.  Not bad.

But if you have a different 10 cases missing on each variable, you will lose 100 cases (10 cases by 10 variables).   With only 5% missing data, you end up with 100 complete cases, 50% missing.  Not so good.

How to Tell if Listwise Deletion is Reasonable

Before you just assume that listwise deletion is an adequate approach, it is important to establish that these two conditions are met.

Spend some time doing missing data diagnosis to understand patterns and randomness of missingness.  Like testing assumptions in linear models, there isn’t one definitive test to tell you if assumptions are met for listwise deletion.  It’s more an exercise in gathering evidence that assumptions aren’t clearly violated.

And if one or the other of these conditions are clearly violated, there are now other good ways to deal with missing data, including maximum likelihood and multiple imputation approaches.

 

Approaches to Missing Data: the Good, the Bad, and the Unthinkable
Learn the different methods for dealing with missing data and how they work in different missing data situations.

Reader Interactions

Comments

  1. Jonathan Bartlett says

    Hi Karen

    Just a small note to add regarding the missingness assumption required for listwise deletion (sometimes also called complete case analysis). As you wrote, if data are missing completely at random, it will be unbiased. When your analysis consists of fitting a regression model, it will also be unbiased provided missingness is independent of the outcome (dependent) variable, conditional on the covariates (independent variables). This condition means the data are not MCAR, and in fact can even be missing not at random in certain setups, but yet the listwise deletion analysis is unbiased.

    I wrote a blog post on this last year, which may be useful:
    http://thestatsgeek.com/2013/07/06/when-is-complete-case-analysis-unbiased/


Leave a Reply

Your email address will not be published. Required fields are marked *

Please note that, due to the large number of comments submitted, any questions on problems related to a personal study/project will not be answered. We suggest joining Statistically Speaking, where you have access to a private forum and more resources 24/7.