Multiple Imputation of Categorical Variables

by Karen


Most Multiple Imputation methods assume multivariate normality, so a common question is how to impute missing values from categorical variables.

Paul Allison, one of my favorite authors of statistical information for researchers, did a study that showed that the most common method actually gives worse results that listwise deletion.  (Did I mention I’ve used it myself?)

What is the bad method?

1. Dummy code the variable

2. Impute a continuous value.  This will generally be between 0 and 1.

3. Round off to either 0 or 1, based on whether the imputed value is below or above .5.

As Allison discovered, this method generally leads to biased results, and incorrect standard errors.

What to do instead?

Allison compared this approach to four others, each of which generally gave more accurate results, at least under some conditions.

1. Listwise deletion

2. Imputation of the continuous variable without rounding (just leave off step 3).

3. Logistic Regression imputation

4. Discriminant Analysis imputation

These last two generally performed best, but only work in limited situations.

Access the full article here.

Leave a Comment

Please note that Karen receives hundreds of comments at The Analysis Factor website each week. Since Karen is also busy teaching workshops, consulting with clients, and running a membership program, she seldom has time to respond to these comments anymore. If you have a question to which you need a timely response, please check out our low-cost monthly membership program, or sign-up for a quick question consultation.

Previous post:

Next post: