Multiple Imputation of Categorical Variables

Most Multiple Imputation methods assume multivariate normality, so a common question is how to impute missing values from categorical variables.

Paul Allison, one of my favorite authors of statistical information for researchers, did a study that showed that the most common method actually gives worse results that listwise deletion.  (Did I mention I’ve used it myself?)

What is the bad method?

1. Dummy code the variable

2. Impute a continuous value.  This will generally be between 0 and 1.

3. Round off to either 0 or 1, based on whether the imputed value is below or above .5.

As Allison discovered, this method generally leads to biased results, and incorrect standard errors.

What to do instead?

Allison compared this approach to four others, each of which generally gave more accurate results, at least under some conditions.

1. Listwise deletion

2. Imputation of the continuous variable without rounding (just leave off step 3).

3. Logistic Regression imputation

4. Discriminant Analysis imputation

These last two generally performed best, but only work in limited situations.

Access the full article here.

 

Approaches to Missing Data: the Good, the Bad, and the Unthinkable
Learn the different methods for dealing with missing data and how they work in different missing data situations.

Reader Interactions

Comments

  1. Charlotte says

    Hi Karen,
    I’m currently trying to use MI for a categorical variable (i.e. whether patients have been readmitted within 6 months of discharge) for which I have used dummy coding (1=readmitted, 2=not readmitted). I’ve manage to carry out the imputation, however I now want to obtain proportions of those readmitted and not readmitted. Can you advise me on how to do this within rounding the dummy variable please.
    Many Thanks,
    Charlotte


Leave a Reply

Your email address will not be published. Required fields are marked *

Please note that, due to the large number of comments submitted, any questions on problems related to a personal study/project will not be answered. We suggest joining Statistically Speaking, where you have access to a private forum and more resources 24/7.