Karen Grace-Martin

Problems Caused by Categorizing Continuous Variables

February 20th, 2009 by Karen Grace-Martin

I just came across this great article by Frank Harrell: Problems Caused by Categorizing Continuous Variables

It’s from the Vanderbilt University biostatistics department, so the examples are all medical, but the points hold for any field.

It goes right along with my recent post, Continuous and Categorical Variables: The Trouble with Median Splits.

No comments yet

Continuous and Categorical Variables: The Trouble with Median Splits

February 16th, 2009 by Karen Grace-Martin

A Median Split is one method for turning a continuous variable into a categorical one. Essentially, the idea is to find the median of the continuous variable. Any value below the median is put it the category “Low” and every value above it is labeled “High.”

This is a very common practice in many social science fields in which researchers are trained in ANOVA but not Regression. At least that was true when I was in grad school in psychology. And yes, oh so many years ago, I used all these techniques I’m going to tell you not to.

There are problems with median splits. The first is purely logical. When a continuum is categorized, every value above the median, for example, is considered equal. Does it really make sense that a value just above the median is considered the same as values way at the end? And different than values just below the median? Not so much.

So one solution is to split the sample into three groups, not two, then drop the middle group. This at least creates some separation between the two groups. The obvious problem, here though, is you’re losing a third of your sample.

The second problem with categorizing a continuous predictor, regardless of how you do it, is loss of power (Aiken & West, 1991). It’s simply harder to find effects that are really there.

So why is it common practice? Because categorizing continuous variables is the only way to stuff them into an ANOVA, which is the only statistics method researchers in many fields are trained to do.

Rather than force a method that isn’t quite appropriate, it would behoove researchers, and the quality of their research, to learn the general linear model and how ANOVA fits into it. It’s really only a short leap from ANOVA to regression but a necessary one. GLMs can include interactions among continuous and categorical predictors just as ANOVA does.

If left continuous, the GLM would fit a regression line to the effect of that continuous predictor. Categorized, the model will compare the means. It often happens that while the difference in means isn’t significant, the slope is.

Reference: Aiken & West (1991). Multiple Regression: Testing and interpreting interactions.

8 comments

Respect Your Data

February 13th, 2009 by Karen Grace-Martin

The steps you take to analyze data are just as important as the statistics you use. Mistakes and frustration in statistical analysis come as much, if not more, from poor process than from using the wrong statistical method.

Benjamin Earnhart of the University of Iowa has written a short (and humorous) article entitled “Respect Your Data” (requires LinkedIn account) that describes 23 practical steps that data analysts must take. This article was published in the newsletter of the American Statistical Association and has since been expanded and annotated

2 comments

Statistical Consulting 101: 4 Questions you Need to Answer to Choose a Statistical Method

February 11th, 2009 by Karen Grace-Martin

One of the most common situations in which researchers get stuck with statistics is choosing which statistical methodology is appropriate to analyze their data. If you start by asking the following four questions, you will be able to narrow things down considerably.

Even if you don’t know the implications of your answers, answering these questions will clarify issues for you. It will help you decide what information to seek, and it will make any conversations you have with statistical advisors more efficient and useful.

1. What is your research question? (more…)

1 comment

When NOT to Center a Predictor Variable in Regression

February 9th, 2009 by Karen Grace-Martin

There are two reasons to center predictor variables in any type of regression analysis–linear, logistic, multilevel, etc.

1. To lessen the correlation between a multiplicative term (interaction or polynomial term) and its component variables (the ones that were multiplied).

2. To make interpretation of parameter estimates easier.

I was recently asked when is centering NOT a good idea? (more…)

22 comments

Order affects Regression Parameter Estimates in SPSS GLM

February 6th, 2009 by Karen Grace-Martin

I just discovered something in SPSS GLM that I never knew.

When you have an interaction in the model, the order you put terms into the Model statement affects which parameters SPSS gives you.

The default in SPSS is to automatically create interaction terms among all the categorical predictors. But if you want fewer than all those interactions, or if you want to put in an interaction involving a continuous variable, you need to choose Model–>Custom Model.

In the specific example of an interaction between a categorical and continuous variable, to interpret this interaction you need to output Regression Coefficients. Do this by choosing Options–>Regression Parameter Estimates.

If you put the main effects into the model first, followed by interactions, you will find the usual output–the regression coefficients (column B) for the continuous variable is the slope for the reference group. The coefficients for the interactions in the other categories tell you the difference between the slope for that category and the slope for the reference group. The coefficient for the reference group here in the interaction is 0.

What I was surprised to find is that if the interactions are put into the model first, you don’t get that.

Instead, the coefficients for the interaction of each category is the actual slope for that group, NOT the difference.

This is actually quite useful–it can save a bit of calculating and now you have a p-value for whether each slope is different from 0. However, it also means you have to be cautious and make sure you realize what each parameter estimate is actually estimating.

No comments yet