Multiple Imputation of Categorical Variables

by Karen


Most Multiple Imputation methods assume multivariate normality, so a common question is how to impute missing values from categorical variables.

Paul Allison, one of my favorite authors of statistical information for researchers, did a study that showed that the most common method actually gives worse results that listwise deletion.  (Did I mention I’ve used it myself?)

What is the bad method?

1. Dummy code the variable

2. Impute a continuous value.  This will generally be between 0 and 1.

3. Round off to either 0 or 1, based on whether the imputed value is below or above .5.

As Allison discovered, this method generally leads to biased results, and incorrect standard errors.

What to do instead?

Allison compared this approach to four others, each of which generally gave more accurate results, at least under some conditions.

1. Listwise deletion

2. Imputation of the continuous variable without rounding (just leave off step 3).

3. Logistic Regression imputation

4. Discriminant Analysis imputation

These last two generally performed best, but only work in limited situations.

Access the full article here.

{ 1 comment… read it below or add one }


Hi Karen,
I’m currently trying to use MI for a categorical variable (i.e. whether patients have been readmitted within 6 months of discharge) for which I have used dummy coding (1=readmitted, 2=not readmitted). I’ve manage to carry out the imputation, however I now want to obtain proportions of those readmitted and not readmitted. Can you advise me on how to do this within rounding the dummy variable please.
Many Thanks,


Leave a Comment

Please note that, due to the large number of comments submitted, any comments on problems related to a personal study/project will not be answered. We suggest joining Statistically Speaking, where you have access to answers and more resources 24/7.

Previous post:

Next post: