Karen Grace-Martin

The Second Problem with Mean Imputation

October 2nd, 2008 by

A previous post discussed the first reason to not use mean imputation as a way of dealing with missing data–it does not preserve the relationships among variables.

A second reason is that any type of single imputation underestimates error variation in any statistic that used the imputed data.  Because the imputations are themselves estimates, there is some error associated with them.  But your statistical software doesn’t know that.  It treats it as real data.

Ultimately, because your standard errors are too low, so are your p-values.  Now you’re making Type I errors without realizing it.

A better approach?  Mulitple Imputation or Full Information Maximum Likelihood.


Introduction to Logistic Regression

September 26th, 2008 by

Researchers are often interested in setting up a model to analyze the relationship between some predictors (i.e., independent variables) and a response (i.e., dependent variable). Linear regression is commonly used when the response variable is continuous.  One assumption of linear models is that the residual errors follow a normal distribution. This assumption fails when the response variable is categorical, so an ordinary linear model is not appropriate. This article presents a regression model for a response variable that is dichotomous–having two categories. Examples are common: whether a plant lives or dies, whether a survey respondent agrees or disagrees with a statement, or whether an at-risk child graduates or drops out from high school.

In ordinary linear regression, the response variable (Y) is a linear function of the coefficients (B0, B1, etc.) that correspond to the predictor variables (X1, X2, etc.). A typical model would look like:

Y = B0 + B1*X1 + B2*X2 + B3*X3 + … + E

For a dichotomous response variable, we could set up a similar linear model to predict individuals’ category memberships if numerical values are used to represent the two categories. Arbitrary values of 1 and 0 are chosen for mathematical convenience. Using the first example, we would assign Y = 1 if a plant lives and Y = 0 if a plant dies.

This linear model does not work well for a few reasons. First, the response values, 0 and 1, are arbitrary, so modeling the actual values of Y is not exactly of interest. Second, it is really the probability that each individual in the population responds with 0 or 1 that we are interested in modeling. For example, we may find that plants with a high level of a fungal infection (X1) fall into the category “the plant lives” (Y) less often than those plants with low level of infection. Thus, as the level of infection rises, the probability of a plant living decreases.

Thus, we might consider modeling P, the probability, as the response variable. Again, there are problems. Although the general decrease in probability is accompanied by a general increase in infection level, we know that P, like all probabilities, can only fall within the boundaries of 0 and 1. Consequently, it is better to assume that the relationship between X1 and P is sigmoidal (S-shaped), rather than a straight line.

It is possible, however, to find a linear relationship between X1 and a function of P. Although a number of functions work, one of the most useful is the logit function. It is the natural log of the odds that Y is equal to 1, which is simply the ratio of the probability that Y is 1 divided by the probability that Y is 0. The relationship between the logit of P and P itself is sigmoidal in shape. The regression equation that results is:

ln[P/(1-P)] = B0 + B1*X1 + B2*X2 + …

Although the left side of this equation looks intimidating, this way of expressing the probability results in the right side of the equation being linear and looking familiar to us. This helps us understand the meaning of the regression coefficients. The coefficients can easily be transformed so that their interpretation makes sense.

The logistic regression equation can be extended beyond the case of a dichotomous response variable to the cases of ordered categories and polytymous categories (more than two categories).

 


A Reason to Not Drop Outliers

September 23rd, 2008 by

I recently had this question in consulting:

I’ve got 12 out of 645 cases with Mahalanobis’s Distances above the critical value, so I removed them and reran the analysis, only to find that another 10 cases were now outside the value. I removed these, and another 10 appeared, and so on until I have removed over 100 cases from my analysis! Surely this can’t be right!?! Do you know any way around this? It is really slowing down my analysis and I have no idea how to sort this out!!

And this was my response:

I wrote an article about dropping outliers.  As you’ll see, you can’t just drop outliers without a REALLY good reason.  Being influential is not in itself a good enough reason to drop data.

 


Outliers: To Drop or Not to Drop

September 17th, 2008 by

Should you drop outliers? Outliers are one of those statistical issues that everyone knows about, but most people aren’t sure how to deal with.  Most parametric statistics, like means, standard deviations, and correlations, and every statistic based on these, are highly sensitive to outliers.

And since the assumptions of common statistical procedures, like linear regression and ANOVA, are also based on these statistics, outliers can really mess up your analysis.

Despite all this, as much as you’d like to, it is NOT acceptable to drop an observation just because it is an outlier.  They can be legitimate observations and are sometimes the most interesting ones.  It’s important to investigate the nature of the outlier before deciding.

  1. If it is obvious that the outlier is due to incorrectly entered or measured data, you should drop the outlier:

    For example, I once analyzed a data set in which a woman’s weight was recorded as 19 lbs.  I knew that was physically impossible.  Her true weight was probably 91, 119, or 190 lbs, but since I didn’t know which one, I dropped the outlier.

    This also applies to a situation in which you know the datum did not accurately measure what you intended.  For example, if you are testing people’s reaction times to an event, but you saw that the participant is not paying attention and randomly hitting the response key, you know it is not an accurate measurement.

  2. If the outlier does not change the results but does affect assumptions, you may drop the outlier.  But note that in a footnote of your paper.

    Neither the presence nor absence of the outlier in the graph below would change the regression line:

    graph-1

  3. More commonly, the outlier affects both results and assumptions.  In this situation, it is not legitimate to simply drop the outlier.  You may run the analysis both with and without it, but you should state in at least a footnote the dropping of any such data points and how the results changed.

    graph-2

  4. If the outlier creates a strong association, you should drop the outlier and should not report any association from your analysis.

    In the following graph, the relationship between X and Y is clearly created by the outlier.  Without it, there is no relationship between X and Y, so the regression coefficient does not truly describe the effect of X on Y.

    graph-3

So in those cases where you shouldn’t drop the outlier, what do you do?

One option is to try a transformation.  Square root and log transformations both pull in high numbers.  This can make assumptions work better if the outlier is a dependent variable and can reduce the impact of a single point if the outlier is an independent variable.

Another option is to try a different model.  This should be done with caution, but it may be that a non-linear model fits better.  For example, in example 3, perhaps an exponential curve fits the data with the outlier intact.

Whichever approach you take, you need to know your data and your research area well.  Try different approaches, and see which make theoretical sense.

 


Multiple Imputation Resources

September 15th, 2008 by

Two excellent resources about multiple imputation and missing data:

Joe Schafer’s Multiple Imputation FAQ Page gives more detail about multiple imputation, including a list of references.

Paul Allison’s 2001 book Missing Data is the most readable book on the topic. It gives in-depth information on many good approaches to missing data, including multiple imputation. It is aimed at social science researchers, and best of all, it is very affordable (about $15).

 


The Statistics Myth: Why Statistics Seems so Hard to Learn

August 31st, 2008 by

There are probably many myths about statistics, but there is one that I believe leads to the most frustration in researchers (and students) as they attempt to learn and apply statistics.

The Carpentry Class: A Fable

There was once a man who needed to build a house. He had a big pile of lumber, and he needed a place to live, so building one himself seemed like a good idea.

He realized that he did not have the knowledge and many skills needed to build a house.

So he did what any intelligent, well-educated person would do. He took a course: House Building 101.

There was a lot of new jargon: trusses, plumb walls, 16” on center, cripple studs. It was hard to keep it all straight. It didn’t make sense. Why would anyone ever need a header anyway?

But he made it through with a B+. He learned the basics. The doghouse he built in the lab was pretty straight. He even took another course to make sure he knew enough: Advanced Carpentry.

It was time for the man to build his house. He had his land, his plan, his tools, his sacks of concrete, windows, lumber, and nails.

The first day he started with enthusiasm. He swung his hammer with gusto and nailed his first wall into place. It felt good.

But wait. His house was being built on a hill. The textbook only had flat land. How should he deal with hills?

And this house has a bay window. His doghouse had only double hung windows. Doesn’t a bay window stick out?

And he was not sure which technique to use to make that 145 degree angle in the hall. The courses never mentioned anything but 90 degree angles.

In class, they used circular saws. In order to install the trim he ordered, he needed to use a chop saw and a table saw.

He didn’t realize he was supposed to put in the plumbing before the electric, so he ended up doing a LOT of rewiring when the plumbing wouldn’t fit around his wires.

Even with the plans in front of him, there were so many decisions to make, so many new skills to learn.

And he was supposed to move into the house in 4 months when his lease ran out. He’d never get it done in time. Not on his own.

He sounds like a fool, doesn’t he? No one could build a house after taking even a few courses. Especially not with a deadline.

Building a house requires the knowledge of how walls are constructed, sure. But it also requires the ability to use the tools, and the practical skills to implement the techniques.

We can see that this project was a silly one to tackle, yet all the time we think it’s our fault that we have trouble with statistical analysis after taking a few classes.

The Statistics Myth:

Having knowledge about statistics is the only thing necessary to practice statistics.

This isn’t true.

And it’s not helpful.

Yes, the knowledge is necessary, but it is not sufficient.

Statistics doesn’t make sense to students because it is taught out of context. Most people don’t really learn statistics until they start analyzing data in their own research. Yes, it makes those classes tough. You need to acquire the knowledge before you can truly understand it.

The only way to learn how to build a house is to build one. The only way to learn how to analyze data is to analyze some.

Here’s the thing. Data analysts (and house builders) need practical support as they learn. Yes, both could slug it out on their own, but it takes longer, is more frustrating, and leads to many more mistakes.

None of this is necessary. There can be a happy ending.

Carpenters work alongside a master to learn their craft. I have never heard of a statistician or a thesis advisor who sits next to a novice analyzing data. (Anyone who had an advisor like that should consider themselves lucky). Unlike a novice carpenter, a novice data analyst is not helpful. They can’t even hold the ladder.

More common are advisors who tell their students which statistics classes to take (again, if they’re lucky) then send them off to analyze data. The student can ask questions as they go along if they are not too afraid to admit what they don’t know.  And if their advisor is available. And knows the answer.

Really good advisors are not too busy to answer in a timely manner and are willing to admit it if they don’t know the answer.

But most data analysts feel a bit lost. Not just new ones—many experienced researchers never really learned statistical practice very well in the first place. Nearly all researchers face new statistical challenges as their research progresses, and it’s often difficult to find someone knowledgeable enough who is willing to and able explain it.

They are not lost because they are stupid.

They are not lost because statistics is beyond their capabilities.

They are not lost because they didn’t do well in their statistics classes.

They are lost because like carpentry, statistical analysis is an applied skill, a craft.

Acquiring the background knowledge is only one essential part of mastering a craft.

Others include:

Think about it.  How many skills (dancing, sailing, teaching) have you acquired in your life by only taking a class that gave you background knowledge, but no real experience and no real mentor to coach you?

So if you’re stuck on something in statistics, give yourself a break.  You can do this with the right support.

Everything we do at The Analysis Factor is to help you get unstuck.  If you’re frustrated, tired, or even scared…there is another way.

 

If you need help right now, we’ve got your back. Please check out our Statistical Consulting services and our Statistically Speaking membership.