dropping outliers

Three Rules of Statistical Analysis from Your Statistics Class to Unlearn

April 28th, 2020 by

There are important ‘rules’ of statistical analysis. Like

But there are others you may have learned in statistics classes that don’t serve you or your analysis well once you’re working with real data.

When you are taking statistics classes, there is a lot going on. You’re learning concepts, vocabulary, and some really crazy notation. And probably a software package on top of that.

In other words, you’re learning a lot of hard stuff all at once

Good statistics professors and textbook authors know that learning comes in stages. Trying to teach the nuances of good applied statistical analysis to students who are struggling to understand basic concepts results in no learning at all.

And yet students need to practice what they’re learning so it sticks. So they teach you simple rules of application.  Those simple rules work just fine for students in a stats class working on sparkling clean textbook data.

But they are over-simplified for you, the data analyst, working with real, messy data. 

Here are three rules of data analysis practice that you may have learned in classes that you need to unlearn.  They are not always wrong. They simply don’t allow for the nuance involved in real statistical analysis.

The Rules of Statistical Analysis to Unlearn:

1. To check statistical assumptions, run a test. Decide whether the assumption is met by the significance of that test. 

Every statistical test and model has assumptions. They’re very important. And they’re not always easy to verify.

For many assumptions, there are tests whose sole job is to test whether the assumption of another test is being met. Examples include the Levene’s test for constant variance and Kolmogorov-Smirnov test, often used for normality. These tests are tools to help you decide if your model assumptions are being met.

But they’re not definitive.

When you’re checking assumptions, there are a lot of contextual issues you need to consider: the sample size, the robustness of the test you’re running, the consequences of not meeting assumptions, and more.

What to do instead:

Use these test results as one of many pieces of information that you’ll use together to decide whether an assumption is violated.

2. Delete outliers that are 3 or more standard deviations from the mean.

This is an egregious one. Really. It’s bad.

Yes, it makes the data look pretty. Yes, there are some situations in which it’s appropriate to delete outliers (like when you have evidence that it’s an error). And yes, outliers can wreak havoc on your parameter estimates.

But don’t make it a habit. Don’t follow a rule blindly.

Deleting outliers because they’re outliers (or using techniques like Winsorizing) is a great way to introduce bias into your results or to miss the most interesting part of your data set.

What to do instead:

When you find an outlier, investigate it. Try to figure out if it’s an error. See if you can figure out where it came from.

3. Check Normality of Dependent Variables before running a linear model

Q-Q plot and histogramIn a t-test, yes, there is an assumption that Y, the dependent variable, is normally distributed within each group. In other words, given the group as defined by X, Y follows a normal distribution.

ANOVA has a similar assumption: given the group as defined by X, Y follows a normal distribution.

In linear regression (and ANCOVA), where we have continuous variables, this same assumption holds. But it’s a little more nuanced since X is not necessarily categorical. At any specific value of X, Y has a normal distribution. (And yes, this is equivalent to saying the errors have a normal distribution).

But here’s the thing: the distribution of Y as a whole doesn’t have to be normal.

In fact, if X has a big effect, the distribution of Y, across all values of X, will often be skewed or bimodal or just a big old mess. This happens even if the distribution of Y, at each value of X, is perfectly normal.

What to do instead:

Because normality depends on which Xs are in a model, check assumptions after you’ve chosen predictors

Conclusion:

The best rule in statistical analysis: always stop and think about your particular data analysis situation.

If you don’t understand or don’t have the experience to evaluate your situation, discuss it with someone who does. Investigate it. This is how you’ll learn.

 


A Reason to Not Drop Outliers

September 23rd, 2008 by

I recently had this question in consulting:

I’ve got 12 out of 645 cases with Mahalanobis’s Distances above the critical value, so I removed them and reran the analysis, only to find that another 10 cases were now outside the value. I removed these, and another 10 appeared, and so on until I have removed over 100 cases from my analysis! Surely this can’t be right!?! Do you know any way around this? It is really slowing down my analysis and I have no idea how to sort this out!!

And this was my response:

I wrote an article about dropping outliers.  As you’ll see, you can’t just drop outliers without a REALLY good reason.  Being influential is not in itself a good enough reason to drop data.

 


Outliers: To Drop or Not to Drop

September 17th, 2008 by

Should you drop outliers? Outliers are one of those statistical issues that everyone knows about, but most people aren’t sure how to deal with.  Most parametric statistics, like means, standard deviations, and correlations, and every statistic based on these, are highly sensitive to outliers.

And since the assumptions of common statistical procedures, like linear regression and ANOVA, are also based on these statistics, outliers can really mess up your analysis.

Despite all this, as much as you’d like to, it is NOT acceptable to drop an observation just because it is an outlier.  They can be legitimate observations and are sometimes the most interesting ones.  It’s important to investigate the nature of the outlier before deciding.

  1. If it is obvious that the outlier is due to incorrectly entered or measured data, you should drop the outlier:

    For example, I once analyzed a data set in which a woman’s weight was recorded as 19 lbs.  I knew that was physically impossible.  Her true weight was probably 91, 119, or 190 lbs, but since I didn’t know which one, I dropped the outlier.

    This also applies to a situation in which you know the datum did not accurately measure what you intended.  For example, if you are testing people’s reaction times to an event, but you saw that the participant is not paying attention and randomly hitting the response key, you know it is not an accurate measurement.

  2. If the outlier does not change the results but does affect assumptions, you may drop the outlier.  But note that in a footnote of your paper.

    Neither the presence nor absence of the outlier in the graph below would change the regression line:

    graph-1

  3. More commonly, the outlier affects both results and assumptions.  In this situation, it is not legitimate to simply drop the outlier.  You may run the analysis both with and without it, but you should state in at least a footnote the dropping of any such data points and how the results changed.

    graph-2

  4. If the outlier creates a strong association, you should drop the outlier and should not report any association from your analysis.

    In the following graph, the relationship between X and Y is clearly created by the outlier.  Without it, there is no relationship between X and Y, so the regression coefficient does not truly describe the effect of X on Y.

    graph-3

So in those cases where you shouldn’t drop the outlier, what do you do?

One option is to try a transformation.  Square root and log transformations both pull in high numbers.  This can make assumptions work better if the outlier is a dependent variable and can reduce the impact of a single point if the outlier is an independent variable.

Another option is to try a different model.  This should be done with caution, but it may be that a non-linear model fits better.  For example, in example 3, perhaps an exponential curve fits the data with the outlier intact.

Whichever approach you take, you need to know your data and your research area well.  Try different approaches, and see which make theoretical sense.