Multiple Imputation of Categorical Variables

Most Multiple Imputation methods assume multivariate normality, so a common question is how to impute missing values from categorical variables.

Paul Allison, one of my favorite authors of statistical information for researchers, did a study that showed that the most common method actually gives worse results that listwise deletion. (Did I mention I’ve used it myself?)

What is the bad method?

1. Dummy code the variable

2. Impute a continuous value. This will generally be between 0 and 1.

3. Round off to either 0 or 1, based on whether the imputed value is below or above .5.

As Allison discovered, this method generally leads to biased results, and incorrect standard errors.

What to do instead?

Allison compared this approach to four others, each of which generally gave more accurate results, at least under some conditions.

1. Listwise deletion

2. Imputation of the continuous variable without rounding (just leave off step 3).

3. Logistic Regression imputation

4. Discriminant Analysis imputation

These last two generally performed best, but only work in limited situations.

Access the full article here.

Approaches to Missing Data: the Good, the Bad, and the Unthinkable

Learn the different methods for dealing with missing data and how they work in different missing data situations.

Comments

Charlotte says

September 12, 2017 at 12:07 pm

Hi Karen,
I’m currently trying to use MI for a categorical variable (i.e. whether patients have been readmitted within 6 months of discharge) for which I have used dummy coding (1=readmitted, 2=not readmitted). I’ve manage to carry out the imputation, however I now want to obtain proportions of those readmitted and not readmitted. Can you advise me on how to do this within rounding the dummy variable please.
Many Thanks,
Charlotte

Reply

Reader Interactions

Comments

Leave a Reply Cancel reply