pairwise deletion

Member Training: Confusing Statistical Terms

February 28th, 2020 by

Learning statistics is difficult enough; throw in some especially confusing terminology and it can feel impossible! There are many ways that statistical language can be confusing.

Some terms mean one thing in the English language, but have another (usually more specific) meaning in statistics.  (more…)

3 Ad-hoc Missing Data Approaches that You Should Never Use

June 15th, 2009 by

The default approach to dealing with missing data in most statistical software packages is listwise deletion–dropping any case with data missing on any variable involved anywhere in the analysis.  It also goes under the names case deletion and complete case analysis.

Although this approach can be really painful (you worked hard to collect those data, only to drop them!), it does work well in some situations.  By works well, I mean it fits 3 criteria:

– gives unbiased parameter estimates

– gives accurate (or at least conservative) standard error estimates

– results in adequate power.

But not always.  So over the years, a number of ad hoc approaches have been proposed to stop the bloodletting of so much data.  Although each solved some problems of listwise deletion, they created others.  All three have been discredited in recent years and should NOT be used.  They are:

Pairwise Deletion: use the available data for each part of an analysis.  This has been shown to result in correlations beyond the 0,1 range and other fun statistical impossibilities.

Mean Imputation: substitute the mean of the observed values for all missing data.  There are so many problems, it’s difficult to list them all, but suffice it to say, this technique never meets the above 3 criteria.

Dummy Variable: create a dummy variable that indicates whether a data point is missing, then substitute any arbitrary value for the missing data in the original variable.  Use both variables in the analysis.  While it does help the loss of power, it usually leads to biased results.

There are a number of good techniques for dealing with missing data, some of which are not hard to use, and which are now available in all major stat software.  There is no reason to continue to use ad hoc techniques that create more problems than they solve.