Two methods for dealing with missing data,vast improvements over traditional approaches, have become available in mainstream statistical software in the last few years.
Both of the methods discussed here require that the data are missing at random–not related to the missing values. If this assumption holds, resulting estimates (i.e., regression coefficients and standard errors) will be unbiased with no loss of power.
The first method is Multiple Imputation (MI). Just like the old-fashioned imputation methods, Multiple Imputation fills in estimates for the missing data. But to capture the uncertainty in those estimates, MI estimates the values multiple times. Because it uses an imputation method with error built in, the multiple estimates should be similar, but not identical.
The result is multiple data sets with identical values for all of the non-missing values and slightly different values for the imputed values in each data set. The statistical analysis of interest, such as ANOVA or logistic regression, is performed separately on each data set, and the results are then combined. Because of the variation in the imputed values, there should also be variation in the parameter estimates, leading to appropriate estimates of standard errors and appropriate p-values.
Multiple Imputation is available in SAS, S-Plus, R, and now SPSS 17.0 (but you need the Missing Values Analysis add-on module).
The second method is to analyze the full, incomplete data set using maximum likelihood estimation. This method does not impute any data, but rather uses each cases available data to compute maximum likelihood estimates. The maximum likelihood estimate of a parameter is the value of the parameter that is most likely to have resulted in the observed data.
When data are missing, we can factor the likelihood function. The likelihood is computed separately for those cases with complete data on some variables and those with complete data on all variables. These two likelihoods are then maximized together to find the estimates. Like multiple imputation, this method gives unbiased parameter estimates and standard errors. One advantage is that it does not require the careful selection of variables used to impute values that Multiple Imputation requires. It is, however, limited to linear models.
Analysis of the full, incomplete data set using maximum likelihood estimation is available in AMOS. AMOS is a structural equation modeling package, but it can run multiple linear regression models. AMOS is easy to use and is now integrated into SPSS, but it will not produce residual plots, influence statistics, and other typical output from regression packages.
Schafer, J. Software for Multiple Imputation
Hox, J.J. (1999) A Review of Current Software for Handling Missing Data, Kwantitatieve Methoden, 62, 123-138.
Allison, P. (2000). Multiple Imputation for Missing Data: A Cautionary Tale, Sociological Methods and Research, 28, 301-309.
Ten years ago, I got stuck with many Missing Data issues. They’re the reason I extensively studied new approaches to missing data and developed the Effectively Dealing with Missing Data Without Biasing Your Results online workshop. By the end of the workshop, you’ll know when and how to impute well, how and when to use maximum likelihood techniques, and when simple, traditional techniques like listwise deletion work just fine.