• Skip to primary navigation
  • Skip to main content
  • Skip to primary sidebar
The Analysis Factor

The Analysis Factor

Statistical Consulting, Resources, and Statistics Workshops for Researchers

  • Home
  • Our Programs
    • Membership
    • Online Workshops
    • Free Webinars
    • Consulting Services
  • About
    • Our Team
    • Our Core Values
    • Our Privacy Policy
    • Employment
    • Collaborate with Us
  • Statistical Resources
  • Contact
  • Blog
  • Login

Karen Grace-Martin

The Second Problem with Mean Imputation

by Karen Grace-Martin Leave a Comment

A previous post discussed the first reason to not use mean imputation as a way of dealing with missing data–it does not preserve the relationships among variables.

A second reason is that any type of single imputation underestimates error variation in any statistic that used the imputed data.  Because the imputations are themselves estimates, there is some error associated with them.  But your statistical software doesn’t know that.  It treats it as real data.

Ultimately, because your standard errors are too low, so are your p-values.  Now you’re making Type I errors without realizing it.

A better approach?  Mulitple Imputation or Full Information Maximum Likelihood.


Related Posts

  • Is Multiple Imputation Possible in the Context of Survival Analysis?
  • The 3 Stages of Mastering Statistical Analysis
  • Multiple Imputation Resources
  • Power and Sample Size Calculations

Introduction to Logistic Regression

by Karen Grace-Martin 1 Comment

Researchers are often interested in setting up a model to analyze the relationship between some predictors (i.e., independent variables) and a response (i.e., dependent variable). Linear regression is commonly used when the response variable is continuous.  One assumption of linear models is that the residual errors follow a normal distribution. This assumption fails when the response variable is categorical, so an ordinary linear model is not appropriate. This article presents a regression model for a response variable that is dichotomous–having two categories. Examples are common: whether a plant lives or dies, whether a survey respondent agrees or disagrees with a statement, or whether an at-risk child graduates or drops out from high school.

In ordinary linear regression, the response variable (Y) is a linear function of the coefficients (B0, B1, etc.) that correspond to the predictor variables (X1, X2, etc.). A typical model would look like:

Y = B0 + B1*X1 + B2*X2 + B3*X3 + … + E

For a dichotomous response variable, we could set up a similar linear model to predict individuals’ category memberships if numerical values are used to represent the two categories. Arbitrary values of 1 and 0 are chosen for mathematical convenience. Using the first example, we would assign Y = 1 if a plant lives and Y = 0 if a plant dies.

This linear model does not work well for a few reasons. First, the response values, 0 and 1, are arbitrary, so modeling the actual values of Y is not exactly of interest. Second, it is really the probability that each individual in the population responds with 0 or 1 that we are interested in modeling. For example, we may find that plants with a high level of a fungal infection (X1) fall into the category “the plant lives” (Y) less often than those plants with low level of infection. Thus, as the level of infection rises, the probability of a plant living decreases.

Thus, we might consider modeling P, the probability, as the response variable. Again, there are problems. Although the general decrease in probability is accompanied by a general increase in infection level, we know that P, like all probabilities, can only fall within the boundaries of 0 and 1. Consequently, it is better to assume that the relationship between X1 and P is sigmoidal (S-shaped), rather than a straight line.

It is possible, however, to find a linear relationship between X1 and a function of P. Although a number of functions work, one of the most useful is the logit function. It is the natural log of the odds that Y is equal to 1, which is simply the ratio of the probability that Y is 1 divided by the probability that Y is 0. The relationship between the logit of P and P itself is sigmoidal in shape. The regression equation that results is:

ln[P/(1-P)] = B0 + B1*X1 + B2*X2 + …

Although the left side of this equation looks intimidating, this way of expressing the probability results in the right side of the equation being linear and looking familiar to us. This helps us understand the meaning of the regression coefficients. The coefficients can easily be transformed so that their interpretation makes sense.

The logistic regression equation can be extended beyond the case of a dichotomous response variable to the cases of ordered categories and polytymous categories (more than two categories).


Bookmark and Share

Tagged With: binary variable, dichotomous response, log-odds, logistic regression, ordered categories, polytymous categories, predictors, sigmoidal relationship

Related Posts

  • Why Logistic Regression for Binary Response?
  • Member Training: Explaining Logistic Regression Results to Non-Researchers
  • How to Decide Between Multinomial and Ordinal Logistic Regression Models
  • When to Use Logistic Regression for Percentages and Counts

A Reason to Not Drop Outliers

by Karen Grace-Martin 4 Comments

I recently had this question in consulting:

I’ve got 12 out of 645 cases with Mahalanobis’s Distances above the critical value, so I removed them and reran the analysis, only to find that another 10 cases were now outside the value. I removed these, and another 10 appeared, and so on until I have removed over 100 cases from my analysis! Surely this can’t be right!?! Do you know any way around this? It is really slowing down my analysis and I have no idea how to sort this out!!

And this was my response:

I wrote an article about dropping outliers.  As you’ll see, you can’t just drop outliers without a REALLY good reason.  Being influential is not in itself a good enough reason to drop data.

Tagged With: dropping outliers, influential outliers, Mahanalobis Distance, outliers

Related Posts

  • Three ‘Rules’ of Statistical Analysis from Your Statistics Class to Unlearn
  • Outliers: To Drop or Not to Drop
  • Best Practices for Organizing your Data Analysis
  • Three Habits in Data Analysis That Feel Efficient, Yet are Not

Outliers: To Drop or Not to Drop

by Karen Grace-Martin 24 Comments

Should you drop outliers? Outliers are one of those statistical issues that everyone knows about, but most people aren’t sure how to deal with.  Most parametric statistics, like means, standard deviations, and correlations, and every statistic based on these, are highly sensitive to outliers.

And since the assumptions of common statistical procedures, like linear regression and ANOVA, are also based on these statistics, outliers can really mess up your analysis.

Despite all this, as much as you’d like to, it is NOT acceptable to drop an observation just because it is an outlier.  They can be legitimate observations and are sometimes the most interesting ones.  It’s important to investigate the nature of the outlier before deciding.

  1. If it is obvious that the outlier is due to incorrectly entered or measured data, you should drop the outlier:

    For example, I once analyzed a data set in which a woman’s weight was recorded as 19 lbs.  I knew that was physically impossible.  Her true weight was probably 91, 119, or 190 lbs, but since I didn’t know which one, I dropped the outlier.

    This also applies to a situation in which you know the datum did not accurately measure what you intended.  For example, if you are testing people’s reaction times to an event, but you saw that the participant is not paying attention and randomly hitting the response key, you know it is not an accurate measurement.

  2. If the outlier does not change the results but does affect assumptions, you may drop the outlier.  But note that in a footnote of your paper.

    Neither the presence nor absence of the outlier in the graph below would change the regression line:

    graph-1

  3. More commonly, the outlier affects both results and assumptions.  In this situation, it is not legitimate to simply drop the outlier.  You may run the analysis both with and without it, but you should state in at least a footnote the dropping of any such data points and how the results changed.

    graph-2

  4. If the outlier creates a strong association, you should drop the outlier and should not report any association from your analysis.

    In the following graph, the relationship between X and Y is clearly created by the outlier.  Without it, there is no relationship between X and Y, so the regression coefficient does not truly describe the effect of X on Y.

    graph-3

So in those cases where you shouldn’t drop the outlier, what do you do?

One option is to try a transformation.  Square root and log transformations both pull in high numbers.  This can make assumptions work better if the outlier is a dependent variable and can reduce the impact of a single point if the outlier is an independent variable.

Another option is to try a different model.  This should be done with caution, but it may be that a non-linear model fits better.  For example, in example 3, perhaps an exponential curve fits the data with the outlier intact.

Whichever approach you take, you need to know your data and your research area well.  Try different approaches, and see which make theoretical sense.

Tagged With: dropping outliers, outliers, regression assumptions, transformation

Related Posts

  • Best Practices for Data Preparation
  • Four Weeds of Data Analysis That are Easy to Get Lost In
  • Member Training: Data Cleaning
  • Member Training: Determining Levels of Measurement: What Lies Beneath the Surface

Multiple Imputation Resources

by Karen Grace-Martin Leave a Comment

Two excellent resources about multiple imputation and missing data:

Joe Schafer’s Multiple Imputation FAQ Page gives more detail about multiple imputation, including a list of references.

Paul Allison’s 2001 book Missing Data is the most readable book on the topic. It gives in-depth information on many good approaches to missing data, including multiple imputation. It is aimed at social science researchers, and best of all, it is very affordable (about $15).

Tagged With: Missing Data, Multiple Imputation

Related Posts

  • Is Multiple Imputation Possible in the Context of Survival Analysis?
  • Multiple Imputation in a Nutshell
  • Member Training: Missing Data
  • Member Training: Multiple Imputation for Missing Data

The Statistics Myth: Why Statistics Seems so Hard to Learn

by Karen Grace-Martin 24 Comments

There are probably many myths about statistics, but there is one that I believe leads to the most frustration in researchers (and students) as they attempt to learn and apply statistics.

The Carpentry Class: A Fable

There was once a man who needed to build a house. He had a big pile of lumber, and he needed a place to live, so building one himself seemed like a good idea.

He realized that he did not have the knowledge and many skills needed to build a house.

So he did what any intelligent, well-educated person would do. He took a course: House Building 101.

There was a lot of new jargon: trusses, plumb walls, 16” on center, cripple studs. It was hard to keep it all straight. It didn’t make sense. Why would anyone ever need a header anyway?

But he made it through with a B+. He learned the basics. The doghouse he built in the lab was pretty straight. He even took another course to make sure he knew enough: Advanced Carpentry.

It was time for the man to build his house. He had his land, his plan, his tools, his sacks of concrete, windows, lumber, and nails.

The first day he started with enthusiasm. He swung his hammer with gusto and nailed his first wall into place. It felt good.

But wait. His house was being built on a hill. The textbook only had flat land. How should he deal with hills?

And this house has a bay window. His doghouse had only double hung windows. Doesn’t a bay window stick out?

And he was not sure which technique to use to make that 145 degree angle in the hall. The courses never mentioned anything but 90 degree angles.

In class, they used circular saws. In order to install the trim he ordered, he needed to use a chop saw and a table saw.

He didn’t realize he was supposed to put in the plumbing before the electric, so he ended up doing a LOT of rewiring when the plumbing wouldn’t fit around his wires.

Even with the plans in front of him, there were so many decisions to make, so many new skills to learn.

And he was supposed to move into the house in 4 months when his lease ran out. He’d never get it done in time. Not on his own.

He sounds like a fool, doesn’t he? No one could build a house after taking even a few courses. Especially not with a deadline.

Building a house requires the knowledge of how walls are constructed, sure. But it also requires the ability to use the tools, and the practical skills to implement the techniques.

We can see that this project was a silly one to tackle, yet all the time we think it’s our fault that we have trouble with statistical analysis after taking a few classes.

The Statistics Myth:

Having knowledge about statistics is the only thing necessary to practice statistics.

This isn’t true.

And it’s not helpful.

Yes, the knowledge is necessary, but it is not sufficient.

Statistics doesn’t make sense to students because it is taught out of context. Most people don’t really learn statistics until they start analyzing data in their own research. Yes, it makes those classes tough. You need to acquire the knowledge before you can truly understand it.

The only way to learn how to build a house is to build one. The only way to learn how to analyze data is to analyze some.

Here’s the thing. Data analysts (and house builders) need practical support as they learn. Yes, both could slug it out on their own, but it takes longer, is more frustrating, and leads to many more mistakes.

None of this is necessary. There can be a happy ending.

Carpenters work alongside a master to learn their craft. I have never heard of a statistician or a thesis advisor who sits next to a novice analyzing data. (Anyone who had an advisor like that should consider themselves lucky). Unlike a novice carpenter, a novice data analyst is not helpful. They can’t even hold the ladder.

More common are advisors who tell their students which statistics classes to take (again, if they’re lucky) then send them off to analyze data. The student can ask questions as they go along if they are not too afraid to admit what they don’t know.  And if their advisor is available. And knows the answer.

Really good advisors are not too busy to answer in a timely manner and are willing to admit it if they don’t know the answer.

But most data analysts feel a bit lost. Not just new ones—many experienced researchers never really learned statistical practice very well in the first place. Nearly all researchers face new statistical challenges as their research progresses, and it’s often difficult to find someone knowledgeable enough who is willing to and able explain it.

They are not lost because they are stupid.

They are not lost because statistics is beyond their capabilities.

They are not lost because they didn’t do well in their statistics classes.

They are lost because like carpentry, statistical analysis is an applied skill, a craft.

Acquiring the background knowledge is only one essential part of mastering a craft.

Others include:

  • a belief you can do it
  • a commitment to best practices
  • experience in applying the skills in different situations
  • proficiency in using the tools
  • a resource library
  • ongoing training to learn new skills
  • (ideally) a mentor to guide you as you practice.

Think about it.  How many skills (dancing, sailing, teaching) have you acquired in your life by only taking a class that gave you background knowledge, but no real experience and no real mentor to coach you?

So if you’re stuck on something in statistics, give yourself a break.  You can do this with the right support.

Everything we do at The Analysis Factor is to help you get unstuck.  If you’re frustrated, tired, or even scared…there is another way.

 

If you need help right now, we’ve got your back. Please check out our Statistical Consulting services and our Statistically Speaking membership.

Tagged With: learning statistics, statistics

Related Posts

  • Best Practices for Organizing your Data Analysis
  • Three Habits in Data Analysis That Feel Efficient, Yet are Not
  • Member Training: Heterogeneity in Meta-analysis
  • Best Practices for Data Preparation

  • « Go to Previous Page
  • Go to page 1
  • Interim pages omitted …
  • Go to page 39
  • Go to page 40
  • Go to page 41
  • Go to page 42
  • Go to Next Page »

Primary Sidebar

This Month’s Statistically Speaking Live Training

  • Member Training: Introduction to SPSS Software Tutorial

Upcoming Free Webinars

Poisson and Negative Binomial Regression Models for Count Data

Upcoming Workshops

  • Analyzing Count Data: Poisson, Negative Binomial, and Other Essential Models (Jul 2022)
  • Introduction to Generalized Linear Mixed Models (Jul 2022)

Copyright © 2008–2022 The Analysis Factor, LLC. All rights reserved.
877-272-8096   Contact Us

The Analysis Factor uses cookies to ensure that we give you the best experience of our website. If you continue we assume that you consent to receive cookies on all websites from The Analysis Factor.
Continue Privacy Policy
Privacy & Cookies Policy

Privacy Overview

This website uses cookies to improve your experience while you navigate through the website. Out of these, the cookies that are categorized as necessary are stored on your browser as they are essential for the working of basic functionalities of the website. We also use third-party cookies that help us analyze and understand how you use this website. These cookies will be stored in your browser only with your consent. You also have the option to opt-out of these cookies. But opting out of some of these cookies may affect your browsing experience.
Necessary
Always Enabled
Necessary cookies are absolutely essential for the website to function properly. This category only includes cookies that ensures basic functionalities and security features of the website. These cookies do not store any personal information.
Non-necessary
Any cookies that may not be particularly necessary for the website to function and is used specifically to collect user personal data via analytics, ads, other embedded contents are termed as non-necessary cookies. It is mandatory to procure user consent prior to running these cookies on your website.
SAVE & ACCEPT