Missing Data

3 Ad-hoc Missing Data Approaches that You Should Never Use

June 15th, 2009 by Karen Grace-Martin

The default approach to dealing with missing data in most statistical software packages is listwise deletion–dropping any case with data missing on any variable involved anywhere in the analysis. It also goes under the names case deletion and complete case analysis.

Although this approach can be really painful (you worked hard to collect those data, only to drop them!), it does work well in some situations. By works well, I mean it fits 3 criteria:

– gives unbiased parameter estimates

– gives accurate (or at least conservative) standard error estimates

– results in adequate power.

But not always. So over the years, a number of ad hoc approaches have been proposed to stop the bloodletting of so much data. Although each solved some problems of listwise deletion, they created others. All three have been discredited in recent years and should NOT be used. They are:

Pairwise Deletion: use the available data for each part of an analysis. This has been shown to result in correlations beyond the 0,1 range and other fun statistical impossibilities.

Mean Imputation: substitute the mean of the observed values for all missing data. There are so many problems, it’s difficult to list them all, but suffice it to say, this technique never meets the above 3 criteria.

Dummy Variable: create a dummy variable that indicates whether a data point is missing, then substitute any arbitrary value for the missing data in the original variable. Use both variables in the analysis. While it does help the loss of power, it usually leads to biased results.

There are a number of good techniques for dealing with missing data, some of which are not hard to use, and which are now available in all major stat software. There is no reason to continue to use ad hoc techniques that create more problems than they solve.

No comments yet

Diagnosing Missing Data: A new way to graph missingness

June 4th, 2009 by Karen Grace-Martin

Some approaches to missing data work well in some situations, but perform very poorly in others. So it’s really important to get a good idea of the type and pattern of missingness in your data. You may even take different missing data approaches to different variables.

Matt Blackwell of the Harvard Social Science Statistics blog has come up with a nice way to visualize the missingness patterns in a data set. (I’m a big fan of graphing data to understand it). He calls it a Missingness Map.

The only drawback seems to be that it will be cumbersome for large data sets.

2 comments

Multiple Imputation of Categorical Variables

June 1st, 2009 by Karen Grace-Martin

Most Multiple Imputation methods assume multivariate normality, so a common question is how to impute missing values from categorical variables.

Paul Allison, one of my favorite authors of statistical information for researchers, did a study that showed that the most common method actually gives worse results that listwise deletion. (Did I mention I’ve used it myself?) (more…)

1 comment

Missing Data: Criteria for Choosing an Effective Approach

May 20th, 2009 by Karen Grace-Martin

In choosing an approach to missing data, there are a number of things to consider. But you need to keep in mind what you’re aiming for before you can even consider which approach to take.

There are three criteria we’re aiming for with any missing data technique:

1. Unbiased parameter estimates: Whether you’re estimating means, regressions, or odds ratios, you want your parameter estimates to be accurate representations of the actual population parameters. In statistical terms, that means the estimates should be unbiased. If all the (more…)

2 comments

Five Advantages of Running Repeated Measures ANOVA as a Mixed Model

May 13th, 2009 by Karen Grace-Martin

There are two ways to run a repeated measures analysis.The traditional way is to treat it as a multivariate test–each response is considered a separate variable.The other way is to it as a mixed model.While the multivariate approach is easy to run and quite intuitive, there are a number of advantages to running a repeated measures analysis as a mixed model.

First I will explain the difference between the approaches, then briefly describe some of the advantages of using the mixed models approach. (more…)

22 comments

EM Imputation and Missing Data: Is Mean Imputation Really so Terrible?

April 15th, 2009 by Karen Grace-Martin

I’m sure I don’t need to explain to you all the problems that occur as a result of missing data. Anyone who has dealt with missing data—that means everyone who has ever worked with real data—knows about the loss of power and sample size, and the potential bias in your data that comes with listwise deletion.

Listwise deletion is the default method for dealing with missing data in most statistical software packages. It simply means excluding from the analysis any cases with data missing on any variables involved in the analysis.

A very simple, and in many ways appealing, method devised to (more…)

31 comments