In my last post, I gave a little quiz about missing data. This post has the answers.
If you want to try it yourself before you see the answers, go here. (It’s a short quiz, but if you’re like me, you find testing yourself irresistible).
True or False?
1. Imputation is really just making up data to artificially inflate results. It’s better to just drop cases with missing data than to impute.
Imputation has gotten a bad rap because early imputation methods, like mean imputation, bias your results pretty badly. And single imputation underestimates standard errors.
But imputation has come a long way, baby!
Multiple imputation, when done well, gives pretty much the same unbiased results, with full power, as the full non-missing data set.
2. I can just impute the mean for any missing data. It won’t affect results, and improves power.
As I just said, mean imputation is bad imputation. It does improve power, but your results will be so biased, the improved power won’t help much. Sure, your results might be significant, but they’re the wrong results!
3. Mulitple Imputation is fine for the predictor variables in a statistical model, but not for the response variable.
It’s true that imputing the response doesn’t add any new information to your regression model. But if you have missing data in the predictors as well, simultaneously imputing both reponse and predictors improves those predictor imputations.
4. Multiple Imputation is always the best way to deal with missing data.
It often is, and is a good result. But it’s not always easy to do well, and it is a large sample technique.
If you’re running a linear or log-linear model, (like a regression or linear mixed model), maximum likelihood techniques give the same great, unbiased, uninflated, full power results that multiple imputation does.
But you don’t have to spend the time and resources imputing anything.
5. When imputing, it’s important that the imputations be plausible data points.
It’s counter-intuitive, but it’s not actually important that imputations be plausible data points. The important thing when imputing is that your parameter estimates–your means, regression coefficients, or whatever it is you’re using this data to estimate–be accurate. Not the imputed data itself.
There are a number of situations, like imputing categorical data, where you actually get better parameter estimates when the imputed data itself aren’t plausible values.
6. Missing data isn’t really a problem if I’m just doing simple statistics, like chi-squares and t-tests.
It’s not the analysis you’re doing, but the percent, pattern, and randomness of the missing data that determines how problematic missing data are.
Even simple statistics need to be accurate and unbiased. How important is it that your results are correct?
7. The worst thing that missing data does is lower sample size and reduce power.
The loss of power from listwise deletion–the default in most software–can be quite devastating.
But even worse are the other two effects of missing data: biased parameter estimates and biased standard errors. They, in essence, make your results, including p-values, wrong.
And they’re worse than low power because you can’t tell they’re wrong. If you lose half your sample and have no significant results, you notice. If the regression coefficients or standard errors aren’t what they’re supposed to be, there’s no way to tell.
That makes it worse in my book.
How did you do? (BTW, it took me years of seminars, reading, and trying things out to figure this all out).