Data Cleaning is a critically important part of any data analysis. Without properly prepared data, the analysis will yield inaccurate results. Correcting errors later in the analysis adds to the time, effort, and cost of the project.
outliers
Outliers and Their Origins
Outliers are one of those realities of data analysis that no one can avoid.
Those pesky extreme values cause biased parameter estimates, non-normality in otherwise beautifully normal variables, and inflated variances.
Everyone agrees that outliers cause trouble with parametric analyses. But not everyone agrees that they’re always a problem, or what to do about them even if they are.
Sometimes a nonparametric or robust alternative is available — and sometimes not.
There are a number of approaches in statistical analysis for dealing with outliers and the problems they create. It’s common for committee members or Reviewer #2 to have very strong opinions that there is one and only one good approach.
Two approaches that I’ve commonly seen are: 1) delete outliers from the sample, or 2) winsorize them (i.e., replace the outlier value with one that is less extreme).
The problem with both of these “solutions” is that they also cause problems — biased parameter estimates and underweighted or eliminated valid values. [Read more…] about Outliers and Their Origins
Member Training: Working with Truncated and Censored Data
Statistically speaking, when we see a continuous outcome variable we often worry about outliers and how these extreme observations can impact our model.
But have you ever had an outcome variable with no outliers because there was a boundary value at which accurate measurements couldn’t be or weren’t recorded?
Examples include:
- Income data where all values above $100,000 are recorded as $100k or greater
- Soil toxicity ratings where the device cannot measure values below 1 ppm
- Number of arrests where there are no zeros because the data set came from police records where all participants had at least one arrest
These are all examples of data that are truncated or censored. Failing to incorporate the truncation or censoring will result in biased results.
This webinar will discuss what truncated and censored data are and how to identify them.
There are several different models that are used with this type of data. We will go over each model and discuss which type of data is appropriate for each model.
We will then compare the results of models that account for truncated or censored data to those that do not. From this you will see what possible impact the wrong model choice has on the results.
Note: This training is an exclusive benefit to members of the Statistically Speaking Membership Program and part of the Stat’s Amore Trainings Series. Each Stat’s Amore Training is approximately 90 minutes long.
[Read more…] about Member Training: Working with Truncated and Censored Data
Incorporating Graphs in Regression Diagnostics with Stata
by Jeff Meyer
You put a lot of work into preparing and cleaning your data. Running the model is the moment of excitement.
You look at your tables and interpret the results. But first you remember that one or more variables had a few outliers. Did these outliers impact your results? [Read more…] about Incorporating Graphs in Regression Diagnostics with Stata
Five things you need to know before learning Structural Equation Modeling
By Manolo Romero Escobar
If you already know the principles of general linear modeling (GLM) you are on the right path to understand Structural Equation Modeling (SEM).
As you could see from my previous post, SEM offers the flexibility of adding paths between predictors in a way that would take you several GLM models and still leave you with unanswered questions.
It also helps you use latent variables (as you will see in future posts).
GLM is just one of the pieces of the puzzle to fit SEM to your data. You also need to have an understanding of:
[Read more…] about Five things you need to know before learning Structural Equation Modeling
Member Training: Outliers and Influential Points
Outliers. There are as many opinions on what to do about them as there are causes for them.
In this webinar, we’ll explore the different types of outliers, methods for figuring out which type you have, whether they’re influential, and what to do about them.
Note: This training is an exclusive benefit to members of the Statistically Speaking Membership Program and part of the Stat’s Amore Trainings Series. Each Stat’s Amore Training is approximately 90 minutes long.
About the Instructor
Karen Grace-Martin helps statistics practitioners gain an intuitive understanding of how statistics is applied to real data in research studies.
She has guided and trained researchers through their statistical analysis for over 15 years as a statistical consultant at Cornell University and through The Analysis Factor. She has master’s degrees in both applied statistics and social psychology and is an expert in SPSS and SAS.
Not a Member Yet?
It’s never too early to set yourself up for successful analysis with support and training from expert statisticians.
Just head over and sign up for Statistically Speaking.
You'll get access to this training webinar and 85+ other stats trainings — plus the expert guidance you need to progress with live Q&A sessions and an ask-a-mentor forum.