• Skip to primary navigation
  • Skip to main content
  • Skip to primary sidebar
The Analysis Factor

The Analysis Factor

Statistical Consulting, Resources, and Statistics Workshops for Researchers

  • Home
  • About
    • Our Programs
    • Our Team
    • Our Core Values
    • Our Privacy Policy
    • Employment
    • Guest Instructors
  • Membership
    • Statistically Speaking Membership Program
    • Login
  • Workshops
    • Online Workshops
    • Login
  • Consulting
    • Statistical Consulting Services
    • Login
  • Free Webinars
  • Contact
  • Login

Outliers: To Drop or Not to Drop

by Karen Grace-Martin 26 Comments

Outliers are one of those statistical issues that everyone knows about, but most people aren’t sure how to deal with.  Most parametric statistics, like means, standard deviations, and correlations, and every statistic based on these, are highly sensitive to outliers.  And since the assumptions of common statistical procedures, like linear regression and ANOVA, are also based on these statistics, outliers can really mess up your analysis.

Despite all this, as much as you’d like to, it is NOT acceptable to drop an observation just because it is an outlier.  They can be legitimate observations and are sometimes the most interesting ones.  It’s important to investigate the nature of the outlier before deciding.

  1. If it is obvious that the outlier is due to incorrectly entered or measured data, you should drop the outlier:

    For example, I once analyzed a data set in which a woman’s weight was recorded as 19 lbs.  I knew that was physically impossible.  Her true weight was probably 91, 119, or 190 lbs, but since I didn’t know which one, I dropped the outlier.

    This also applies to a situation in which you know the datum did not accurately measure what you intended.  For example, if you are testing people’s reaction times to an event, but you saw that the participant is not paying attention and randomly hitting the response key, you know it is not an accurate measurement.

  2. If the outlier does not change the results but does affect assumptions, you may drop the outlier.  But note that in a footnote of your paper.

    Neither the presence nor absence of the outlier in the graph below would change the regression line:

    graph-1

  3. More commonly, the outlier affects both results and assumptions.  In this situation, it is not legitimate to simply drop the outlier.  You may run the analysis both with and without it, but you should state in at least a footnote the dropping of any such data points and how the results changed.

    graph-2

  4. If the outlier creates a significant association, you should drop the outlier and should not report any significance from your analysis.

    In the following graph, the relationship between X and Y is clearly created by the outlier.  Without it, there is no relationship between X and Y, so the regression coefficient does not truly describe the effect of X on Y.

    graph-3

So in those cases where you shouldn’t drop the outlier, what do you do?

One option is to try a transformation.  Square root and log transformations both pull in high numbers.  This can make assumptions work better if the outlier is a dependent variable and can reduce the impact of a single point if the outlier is an independent variable.

Another option is to try a different model.  This should be done with caution, but it may be that a non-linear model fits better.  For example, in example 3, perhaps an exponential curve fits the data with the outlier intact.

Whichever approach you take, you need to know your data and your research area well.  Try different approaches, and see which make theoretical sense.

Tagged With: dropping outliers, outliers, regression assumptions, transformation

Related Posts

  • Four Weeds of Data Analysis That are Easy to Get Lost In
  • The Difference Between Model Assumptions, Inference Assumptions, and Data Issues
  • Member Training: Data Cleaning
  • Member Training: Determining Levels of Measurement: What Lies Beneath the Surface

Reader Interactions

Comments

  1. Kelly says

    May 11, 2018 at 9:03 am

    Thank you for this explanation, it is really helpful. Is there an academic article or book that I can refer to when using these guidelines in my thesis?

    Reply
  2. Waseef says

    March 3, 2018 at 10:29 am

    Respected Karen! Can you please add or send me the reference of this justification. As I’m facing this issue in my research but I’ve no reference to mention at the end of Reference list. Advance thanks.

    Reply
  3. Vivien says

    June 20, 2017 at 5:59 pm

    In plot number 2, I do not understand why you want to drop the outlier??
    To my point of view, it tells you that the model is rather robust. Remind that a statistical model should only been apply for prediction within the data range used for its calibration. The larger the data range, the more robust it will be for predicting in new situations

    Reply
  4. Nate says

    May 9, 2017 at 12:38 pm

    When cleaning a large dataset for outliers, does a separate outlier analysis have to be run for every single regression analysis one plans on running? For instance, does running 30 different regressions typically require 30 separate outlier analyses? If so, do the outliers need to be added back into the data set before running the next outlier analysis? If multiple outlier analyses are not required in this case, is just one outlier analysis enough (i.e., entering all of the IVs that I plan to use into one step and regressing only one DV on them simultaneously)?

    Reply
  5. Lasse says

    September 19, 2016 at 5:37 am

    After checking all of the above, I do not understand the rationale for keeping an outlier that affects both assumptions and conclusion just by principle. This seems counterintuitive since most models will never fit the data perfectly and the point is usually investigates overall trends in the sample population, defending the assumptions parametric models and their inability to take into account outliers that are given undue influence only due to the estimation of the model, not because it really defines the Group/variable. In a survival analysis, maybe somebody died of a car accident (but you dont have the death certificate). Biomarkers cant predict that, neither can most genes. It is not really the outlier there is anything wrong with, but the inability of most parametric tests to deal with 1 or 2 extreme observations. If robust estimators are not available, downweighting or dropping a case that changes the entire conclusion of the model seems perfectly fair (and reporting it).

    Reply
  6. Emil M Friedman says

    May 30, 2016 at 5:35 pm

    In example two, the outlier should have little effect on the slope estimate but it ought to have a BIG effect on the standard error of the slope estimate. It would definitely be worth investigating how it came about. A lot might depend on the physical situation involved, whether we are dealing with correlation or with truly independent and dependent variables, etc.

    Reply
  7. Eric Elegado says

    April 6, 2016 at 4:51 am

    Very helpful blog! Thank you. By the way, how do I put this page as a reference to my paper?

    Reply
  8. Mohammed says

    February 17, 2016 at 4:59 am

    Hi,is there any specific method to identify outliers and reduce their impact in ANOVA on Nested design because I have Nested design on Gage R&R which contains 4 outliers in 3 operators . Thanks

    Reply
  9. DanBrew says

    April 24, 2015 at 8:57 am

    Use Spearman correlation insted Pearson. It can handle the outliers.

    Reply
    • Tassadduq says

      May 4, 2017 at 2:03 am

      Can we remove outliers based on CV.
      To lower down CV, change the replication data value but without any change the mean value of treatment

      Reply
  10. Simon says

    March 24, 2015 at 10:04 pm

    My rule of thumb is to always drop the outlier. That’s just me though.

    Reply
    • Tassadduq says

      May 4, 2017 at 2:05 am

      Sir,

      We drop the outlier, but how we can have value in that place to run ANOVA

      Reply
  11. Muthoka says

    March 13, 2015 at 11:54 am

    I tried this in some study and the effects are not trivial. First, my data had some observations which clearly were quite far from the mean (sd of over 50000). I included them and my parameters were significant all through. Upon removing outliers, one of them was not significant and Adj R^2 fell by over 20%. While in my case of over 10000 observations it may be theoretically right to omit them, I don’t know what the same may have on narrow samples or specific studies. I am analysing household consumption expenditure and conclusions based on outliers will most probably be unrepresentative. I tried the robust errors suggested here as well. I think with outliers(their effect is inflating the variances and hence parameter significance) robust errors should be enough, as much as we trust the underlying framework.

    Reply
  12. jordyn says

    December 13, 2014 at 4:09 pm

    I’m confused by number 4. What happens if you take out the outlier, and things become more significant?

    What would you do in this situation?
    I have multivariable logistic regression results: With outlier in model p-values are as follows (age:0.044, ethnicity:0.054, knowledge composite variable: 0.059. When I take out the outlier, values become (age:0.424, eth: 0.039, knowledge: 0.074)

    So by taking out the outlier, 2 variables become less significant while one becomes more significant.

    Reply
  13. Psychometrician says

    December 9, 2014 at 6:54 am

    Should outliers or extremes be removed before developing a norm group for a certain test?

    Reply
  14. MRS IBE IJEOMA says

    May 21, 2014 at 5:18 am

    I FEEL OUTLIERS IF NOT INTENTIONAL ARE OFTEN USEFUL DATA WITH VITAL INFORMATIONS THAT NEEDS SPECIAL ATTENTION.SO THE DON’T NEED TO BE DELECTED FROM THE ANALYSIS.

    Reply
  15. John Newman says

    April 22, 2013 at 9:09 am

    I used a square root to transform the IV weight. This variable is now “more normal” but still significant for non normal. If I use this variable the R2 of my model decreases. How to deal with this? Should I ir should I not use the transformated version of this variable.

    Reply
    • Karen says

      April 29, 2013 at 6:47 pm

      Hi John, here’s the good news: IVs don’t have to be normally distributed. You may want to transform a skewed IV if the values in the long tail are exerting too much influence or to make the relationship with Y linear. But there are no model assumptions about the shape of the IV’s distribution.

      Reply
  16. Oliver says

    July 10, 2012 at 7:23 pm

    Thanks for the article
    Technically, It’s not justifiable to drop an observation just because it’s an outlier! unless you are sure it’s an error, maybe committed during data entry or at any other stage. For regression analysis, I would advise to use robust regression which deals with this problem.

    Reply
    • Karen says

      July 11, 2012 at 3:36 pm

      Hi Oliver,

      Agreed. Never drop just because it’s an outlier.

      And thanks for the suggestion–I agree robust regression often helps.

      Karen

      Reply
      • Sahil says

        November 19, 2016 at 9:33 am

        What is robust regression ?

        Reply
        • Larry says

          November 1, 2019 at 11:05 am

          I think robust statistics are the way to go. When you are deciding on whether to delete an outlier, it’s like deciding if its weight should be 0 or 1. But if it’s not due to an error, then you just want to assign it a lower weight between 0 and 1 so it does not have an undue influence. After all, it is just one point and it shouldn’t influence the results more than any other point.

          Reply

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Please note that, due to the large number of comments submitted, any questions on problems related to a personal study/project will not be answered. We suggest joining Statistically Speaking, where you have access to a private forum and more resources 24/7.

Primary Sidebar

This Month’s Statistically Speaking Live Training

  • February Member Training: Choosing the Best Statistical Analysis

Upcoming Workshops

  • Logistic Regression for Binary, Ordinal, and Multinomial Outcomes (May 2021)
  • Introduction to Generalized Linear Mixed Models (May 2021)

Read Our Book



Data Analysis with SPSS
(4th Edition)

by Stephen Sweet and
Karen Grace-Martin

Statistical Resources by Topic

  • Fundamental Statistics
  • Effect Size Statistics, Power, and Sample Size Calculations
  • Analysis of Variance and Covariance
  • Linear Regression
  • Complex Surveys & Sampling
  • Count Regression Models
  • Logistic Regression
  • Missing Data
  • Mixed and Multilevel Models
  • Principal Component Analysis and Factor Analysis
  • Structural Equation Modeling
  • Survival Analysis and Event History Analysis
  • Data Analysis Practice and Skills
  • R
  • SPSS
  • Stata

Copyright © 2008–2021 The Analysis Factor, LLC. All rights reserved.
877-272-8096   Contact Us

The Analysis Factor uses cookies to ensure that we give you the best experience of our website. If you continue we assume that you consent to receive cookies on all websites from The Analysis Factor.
Continue Privacy Policy
Privacy & Cookies Policy

Privacy Overview

This website uses cookies to improve your experience while you navigate through the website. Out of these, the cookies that are categorized as necessary are stored on your browser as they are essential for the working of basic functionalities of the website. We also use third-party cookies that help us analyze and understand how you use this website. These cookies will be stored in your browser only with your consent. You also have the option to opt-out of these cookies. But opting out of some of these cookies may affect your browsing experience.
Necessary
Always Enabled

Necessary cookies are absolutely essential for the website to function properly. This category only includes cookies that ensures basic functionalities and security features of the website. These cookies do not store any personal information.

Non-necessary

Any cookies that may not be particularly necessary for the website to function and is used specifically to collect user personal data via analytics, ads, other embedded contents are termed as non-necessary cookies. It is mandatory to procure user consent prior to running these cookies on your website.

SAVE & ACCEPT