• Skip to primary navigation
  • Skip to main content
  • Skip to primary sidebar
The Analysis Factor

The Analysis Factor

Statistical Consulting, Resources, and Statistics Workshops for Researchers

  • Home
  • About
    • Our Programs
    • Our Team
    • Our Core Values
    • Our Privacy Policy
    • Employment
    • Guest Instructors
  • Membership
    • Statistically Speaking Membership Program
    • Login
  • Workshops
    • Online Workshops
    • Login
  • Consulting
    • Statistical Consulting Services
    • Login
  • Free Webinars
  • Contact
  • Login

EM Imputation and Missing Data: Is Mean Imputation Really so Terrible?

by Karen Grace-Martin 31 Comments

I’m sure I don’t need to explain to you all the problems that occur as a result of missing data.  Anyone who has dealt with missing data—that means everyone who has ever worked with real data—knows about the loss of power and sample size, and the potential bias in your data that comes with listwise deletion.

Listwise deletion is the default method for dealing with missing data in most statistical software packages.  It simply means excluding from the analysis any cases with data missing on any variables involved in the analysis.

A very simple, and in many ways appealing, method devised to overcome these problems is mean imputation. Once again, I’m sure you’ve heard of it–just plug in the mean for that variable for all the missing values. The nice part is the mean isn’t affected, and you don’t lose that case from the analysis.

And it’s so easy! SPSS even has a little button to click to just impute all those means.

But there are new problems.

While it’s true the mean doesn’t change, the relationships with other variables do. And that’s usually what you’re interested in, right? Well, now they’re biased.

And while the sample size remains at its full value, the standard error of that variable will be vastly underestimated–and this underestimation gets bigger the more missing data there are. Too-small standard errors lead to too-small p-values, so now you’re reporting results that should not be there.

There are other options. Multiple Imputation and Maximum Likelihood both solve these problems. But while Multiple Imputation is not available in all the major stats packages, it is very labor-intensive to do well. And Maximum Likelihood isn’t hard or labor intensive, but usually requires using structural equation modeling software, such as AMOS or MPlus.

The good news is there are other imputation techniques that are still quite simple, and don’t cause bias in some situations. And sometimes (although rarely) it really is okay to use mean imputation. When?

If your rate of missing data is very, very small, it honestly doesn’t matter what technique you use. I’m talking very, very, very small (2-3%).

There is another, better method for imputing single values, however, that is only slightly more difficult than mean imputation. It uses the E-M Algorithm, which stands for Expectation-Maximization. It is an iterative procedure in which it uses other variables to impute a value (Expectation), then checks whether that is the value most likely (Maximization). If not, it re-imputes a more likely value. This goes on until it reaches the most likely value.

EM imputations are better than mean imputations because they preserve the relationship with other variables, which is vital if you go on to use something like Factor Analysis or Linear Regression.

EM Imputations still underestimate standard error, however. Once again, this approach is only reasonable if the standard error of individual items is not vital, like in Factor Analysis, which doesn’t have p-values.

The heavy hitters like Multiple Imputation and Maximum Likelihood are still superior methods of dealing with missing data and are in most situations the only viable approach. But you need to fit the right tool to the size of the problem.

It may be true that backhoes are better at digging holes than trowels, but trowels are just right for digging small holes. It’s better to use a small tool like EM when it fits than to ignore the problem altogether.

EM Imputation is available in SAS, Stata, R, and SPSS Missing Values Analysis module.

Approaches to Missing Data: the Good, the Bad, and the Unthinkable
Learn the different methods for dealing with missing data and how they work in different missing data situations.

Tagged With: EM algorithm, listwise deletion, maximum likelihood, mean imputation, Missing Data, Multiple Imputation, SPSS Missing Values Analysis

Related Posts

  • Two Recommended Solutions for Missing Data: Multiple Imputation and Maximum Likelihood
  • Multiple Imputation in a Nutshell
  • Quiz Yourself about Missing Data
  • Answers to the Missing Data Quiz

Reader Interactions

Comments

  1. Gianfranco says

    November 30, 2018 at 8:11 am

    Thank you for the post.
    In a dataset with both continuos and categorical missings I can’t use EM.
    Is a good strategy:
    1)a logistic imputation of the categoricals;
    2)Expect. Maxim. for the continuos.
    3) Analysis (i.e. linear regression)

    Thanks a lot!

    Reply
    • Karen Grace-Martin says

      November 30, 2018 at 11:54 am

      Hi Gianfranco,

      I can’t advise a strategy without digging in to all the details, but I can say that you cannot mix multiple imputation and EM. And generally speaking all imputation should be multiple.

      Reply
  2. Charlotte Paterson says

    July 24, 2017 at 11:30 am

    Hi Karen,
    Is it appropriate to use multiple imputation for entire outcomes (i.e. entire questionnaires). I’m only trying to produce descriptive stats for a feasibility trial so I have produced completer descriptive analyses (listwise deletion), however there is a large portion of participants with missing follow-up questionnaires. Therefore I wanted to use MI to impute these missing outcomes and compare the descriptive stats (and effect sizes) produced from an imputed data set to a unimputed data set. What do you think?

    Reply
    • Karen Grace-Martin says

      September 11, 2018 at 12:54 pm

      HI Charlotte,

      Yes, it is. Read the John Graham article linked below–he talks about that exact situation.

      But you might actually be better off with a mixed model since there is repeated measures. It will save a lot of time and be just as good. See: https://www.theanalysisfactor.com/missing-data-two-recommended-solutions/

      Reply
  3. mehdi says

    June 6, 2017 at 7:40 am

    Hi karen
    Do you have R code for ” EM algorithm” ? I just wanna Impute missing data with EM .

    Reply
    • Karen Grace-Martin says

      September 11, 2018 at 12:55 pm

      Hi Mehdi,

      Not easily accessible. I am pretty sure the Amelia package has it, if I’m remembering correctly.

      Reply
  4. Hassan Qutub says

    February 7, 2016 at 2:30 pm

    Hi Karen, thanks for the valuable information about missing data. You mentioned
    “If your rate of missing data is very, very small, it honestly doesn’t matter what technique you use. I’m talking very, very, very small (2-3%)”
    Do you have a reference for that? It will really be usefull.

    Regards

    Hassan

    Reply
    • Karen says

      February 22, 2016 at 2:30 pm

      Ooh, I did once. Perhaps it’s in John Graham’s very good article: http://www.stats.ox.ac.uk/~snijders/Graham2009.pdf

      Reply
      • Tara says

        July 4, 2016 at 1:50 pm

        Thanks for providing the reference link!

        Reply
      • Barbara B. says

        May 22, 2018 at 11:18 pm

        Hi Karen,
        unfortunately it is not in Graham (2009). Do you have any other suggestion in regard to a reference of”very, very, very small (2-3%)”.
        that would be really handy. Thanks in advance.
        Barbara

        Reply
        • Karen Grace-Martin says

          October 26, 2018 at 5:22 pm

          Hi Barbara,

          Oh my, I just saw the reference last week, but I’m afraid I don’t remember.

          Reply
  5. Sushant says

    December 11, 2015 at 6:25 pm

    Hi,
    I want some datasets with missing data (I just can’t remove data by myself it has to be random) can you suggest some ? or can you suggest me a way to remove data by a program or a software ?
    and also I need some data set which make non convex data sets, I need these for my experiment on EM and other algorithms.
    Thanks

    Reply
  6. Bianca says

    February 25, 2015 at 9:50 am

    Hey Karen,
    do you have a reliable reference for that 5% limit of missings for using EM? You would help me a lot! 🙂

    Reply
    • Madlaina says

      June 15, 2015 at 8:55 am

      Try: Annual Review of Psychology (Graham, 2009)

      Reply
  7. Lorelai says

    January 20, 2015 at 11:07 am

    Hello, I don’t think you’re quite right about the EM underestimating parameters. In this process, the variance and covariance of that variable is also corrected … as explained in “The SAGE Handbook of Social Science Methodology”, by William Outhwaite and Stephen Turner.

    Reply
    • Karen Grace-Martin says

      September 11, 2018 at 12:59 pm

      Hi Lorelai,

      It’s not that it underestimates the parameter values themselves, but the standard errors of actual model effects.

      So yes, if you’re just estimating means and correlations, you’re fine. But if you want to impute data points and use those in a model, your software doesn’t know that those are estimates and not real data points. That’s where any parameters estimated from those imputed data have too-small standard errors. There is more uncertainty than the model is accounting for. That’s why we need multiple imputation if we’re going to use p-values.

      Reply
  8. pei ting says

    October 29, 2014 at 9:58 am

    Hi Karen,

    Is it correct to say that once i clicked on “impute missing data” for a specific variable, that variable will have no missing data in the imputation dataset?

    I would like to check what went wrong with my procedure.

    I clicked on the Multiple Imputation –> Impute Missing data value in SPSS. All the tabs were left it as default.

    After I clicked “OK” on impute missing data, I noted random missing data is still available in the dataset of imputation_1, imputation_2, imputation_3, imputation_4 and imputation_5.

    Greatly appreciate if you could guide me through.

    Thanks very much!

    Reply
    • Karen says

      November 3, 2014 at 4:45 pm

      Hi Pei,

      Hmm, that is indeed what should happen. I would have to troubleshoot it to figure out what is going wrong. It could be some default in your version of SPSS. I can’t think of one off the top of my head, though that’s often the cause.

      Reply
  9. Sebastian says

    January 15, 2014 at 10:07 am

    Hi Karen
    Would you tend to use “as many variables as possible” as predictors in EM imputation or only construct relevant ones? Why?
    E.g. use socioeconmic status, IQ asf. as predictors for reading proficiency?

    Thanks!

    Reply
    • Karen says

      January 15, 2014 at 10:31 am

      Hi Sebastian,

      As a general rule, you want to use as many predictors that are helpful for prediction. So there may be a predictor that isn’t theoretically important, but is helpful with prediction (for whatever unknown reason). But you don’t want to throw in everything you have. Adding unhelpful predictors just raises standard errors.

      Reply
  10. Kenny says

    June 30, 2013 at 10:56 pm

    What is the R function for the EM imputation?

    Thanks

    Reply
    • Karen says

      July 1, 2013 at 3:45 pm

      Kenny, I don’t use R (maybe an R user can jump in here), but I believe MICE can do it. I am pretty sure it does multiple imputation, and EM is generally one way of doing MI.

      Reply
    • Lorelai says

      January 20, 2015 at 11:09 am

      Just explore the “MI” package on R’s website

      Reply
      • Lorelai says

        January 20, 2015 at 11:16 am

        Or try the function “impute.mdr” from imputeMDR package

        Reply
  11. Kirstine says

    April 23, 2013 at 7:28 am

    Hi Karen,

    Just wondering what you would recommend to do with imputed EM values for ordinal scales. Do you leave the imputed values (with decimal places) or do you recode so that values lie within the original values (from 1.001 to 1.499 = 1 for example). The imputed values are needed for a CFA and multiple regression.

    Thanks for your time,

    Kirstine.

    Reply
    • Marsha says

      December 11, 2013 at 3:34 am

      Hi Karen,
      Not sure if you responded to Kirstine but I had the same question on imputed EM values for the ordinal scale..
      Thanks in advance.
      Marsha

      Reply
      • Karen says

        December 23, 2013 at 1:39 pm

        Hi Marsha,

        As a general rule, you don’t want to round off any imputations. Even if the imputed values look weird, you need to have variation in there, so don’t round them off.

        Reply
        • Mirjana says

          June 1, 2017 at 10:22 am

          What to do when imputed EM vales are zero or negative, or exceed the maximum number (e.g., -4, 0,and 8 and Likert scale is from 1 to 7)?

          Reply
          • Mirjana says

            June 1, 2017 at 10:27 am

            Continue …. In SPSS is impossible to make constaints regarding maximum and minimum values for EM so how it should be solved. And also – Since EM does not impute values for categorical values, such as gender, what to do with them?

      • Henrique says

        April 2, 2020 at 11:32 am

        Hello!
        I have the same doubt as Kristine and Marsha.
        How to input missing values for ordinal variables?
        EM function in SPSS is only available for continuous variables!

        Reply
        • Karen Grace-Martin says

          April 17, 2020 at 2:45 pm

          You have to treat ordinal variables as categorical. No other options.

          Reply

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Please note that, due to the large number of comments submitted, any questions on problems related to a personal study/project will not be answered. We suggest joining Statistically Speaking, where you have access to a private forum and more resources 24/7.

Primary Sidebar

Free Webinars

Effect Size Statistics on Tuesday, Feb 2nd

This Month’s Statistically Speaking Live Training

  • January Member Training: A Gentle Introduction To Random Slopes In Multilevel Models

Upcoming Workshops

  • Logistic Regression for Binary, Ordinal, and Multinomial Outcomes (May 2021)
  • Introduction to Generalized Linear Mixed Models (May 2021)

Read Our Book



Data Analysis with SPSS
(4th Edition)

by Stephen Sweet and
Karen Grace-Martin

Statistical Resources by Topic

  • Fundamental Statistics
  • Effect Size Statistics, Power, and Sample Size Calculations
  • Analysis of Variance and Covariance
  • Linear Regression
  • Complex Surveys & Sampling
  • Count Regression Models
  • Logistic Regression
  • Missing Data
  • Mixed and Multilevel Models
  • Principal Component Analysis and Factor Analysis
  • Structural Equation Modeling
  • Survival Analysis and Event History Analysis
  • Data Analysis Practice and Skills
  • R
  • SPSS
  • Stata

Copyright © 2008–2021 The Analysis Factor, LLC. All rights reserved.
877-272-8096   Contact Us

The Analysis Factor uses cookies to ensure that we give you the best experience of our website. If you continue we assume that you consent to receive cookies on all websites from The Analysis Factor.
Continue Privacy Policy
Privacy & Cookies Policy

Privacy Overview

This website uses cookies to improve your experience while you navigate through the website. Out of these, the cookies that are categorized as necessary are stored on your browser as they are essential for the working of basic functionalities of the website. We also use third-party cookies that help us analyze and understand how you use this website. These cookies will be stored in your browser only with your consent. You also have the option to opt-out of these cookies. But opting out of some of these cookies may affect your browsing experience.
Necessary
Always Enabled

Necessary cookies are absolutely essential for the website to function properly. This category only includes cookies that ensures basic functionalities and security features of the website. These cookies do not store any personal information.

Non-necessary

Any cookies that may not be particularly necessary for the website to function and is used specifically to collect user personal data via analytics, ads, other embedded contents are termed as non-necessary cookies. It is mandatory to procure user consent prior to running these cookies on your website.