Missing Data, and multiple imputation specifically, is one area of statistics that is changing rapidly. Research is still ongoing, and each year new findings on best practices and new techniques in software appear.
The downside for researchers is that some of the recommendations missing data statisticians were making even five years ago have changed.
Remember that there are three goals of multiple imputation, or any missing data technique: Unbiased parameter estimates in the final analysis (regression coefficients, group means, odds ratios, etc.); accurate standard errors of those parameter estimates, and therefore, accurate p-values in the analysis; and adequate power to find meaningful parameter values significant.
So here are a few updates that will help you achieve these goals.
1. Don’t round off imputations for dummy variables. Many common imputation techniques, like MCMC, require normally distributed variables. Suggestions for imputing categorical variables were to dummy code them, impute them, then round off imputed values to 0 or 1. Recent research, however, has found that rounding off imputed values actually leads to biased parameter estimates in the analysis model. You actually get better results by leaving the imputed values at impossible values, even though it’s counter-intuitive.
2. Don’t transform skewed variables. Likewise, when you transform a variable to meet normality assumptions before imputing, you not only are changing the distribution of that variable but the relationship between that variable and the others you use to impute. Doing so can lead to imputing outliers, creating more bias than just imputing the skewed variable.
3. Use more imputations. The advice for years has been that 5-10 imputations are adequate. And while this is true for unbiasedness, you can get inconsistent results if you run the multiple imputation more than once. Bodner (2008) recommends having as many imputations as the percentage of missing data. Since running more imputations isn’t any more work for the data analyst, there’s no reason not to.
4. Create multiplicative terms before imputing. When the analysis model contains a multiplicative term, like an interaction term or a quadratic, create the multiplicative terms first, then impute. Imputing first, and then creating the multiplicative terms actually biases the regression parameters of the multiplicative term (von Hippel, 2009).
5. Alternatives to multiple imputation aren’t usually better. Multiple imputation assumes the data are missing at random. In most tests, if an assumption is not met, there are better alternatives—a nonparametric test or an alternative type of model. This is often not true with missing data. Alternatives like listwise deletion (a.k.a. ignoring it) have more stringent assumptions. So do nonignorable missing data techniques like Heckman’s selection models.
Allison, Paul D. 2005. “Imputation of Categorical Variables with PROC MI,” Presented at
the 30th Meeting of SAS Users Group International, April 10–13, Philadephia, PA.
Bodner, T. E. 2008. “What Improves with Increased Missing Data Imputations?”
Structural Equation Modeling 15(4):651–75.
Graham, J. W. (2009). Missing data analysis: Making it work in the real world. Annual Review of Psychology, 60, 549-576.
von Hippel, P.T. (2009). “How To Impute Squares, Interactions, and Other Transformed Variables.” Sociological Methodology 39.