You’ve probably experienced this before. You’ve done a statistical analysis, you’ve figured out all the steps, you finally get results and are able to interpret them. But they just look…wrong. Backwards, or even impossible—theoretically or logically.

This happened a few times recently to a couple of my consulting clients, and once to me. So I know that feeling of panic well. There are so many possible causes of incorrect results, but there are a few steps you can take that will help you figure out which one you’ve got and how (and whether) to correct it.

### Errors in Data Coding and Entry

In both of my clients’ cases, the problem was that they had coded missing data with an impossible and extreme value, like 99. But they failed to define that code as missing in SPSS. So SPSS took 99 as a real data point, which (more…)

Someone who registered for my upcoming Interpreting (Even Tricky) Regression Models workshop asked if the content applies to logistic regression as well.

The short answer: **Yes**

The long-winded detailed explanation of why this is true and the one caveat:

One of the greatest things about regression models is that they all have the same set up: (more…)

I recently received this email, which I thought was a great question, and one of wider interest…

Hello Karen,

I am an MPH student in biostatistics and I am curious about using regression for tests of associations in applied statistical analysis. Why is using regression, or logistic regression “better” than doing bivariate analysis such as Chi-square?

I read a lot of studies in my graduate school studies, and it seems like half of the studies use Chi-Square to test for association between variables, and the other half, who just seem to be trying to be fancy, conduct some complicated regression-adjusted for-controlled by- model. But the end results seem to be the same. I have worked with some professionals that say simple is better, and that using Chi- Square is just fine, but I have worked with other professors that insist on building models. It also just seems so much more simple to do chi-square when you are doing primarily categorical analysis.

My professors don’t seem to be able to give me a simple justified

answer, so I thought I’d ask you. I enjoy reading your site and plan to begin participating in your webinars.

Thank you!

(more…)

Most Multiple Imputation methods assume multivariate normality, so a common question is how to impute missing values from categorical variables.

Paul Allison, one of my favorite authors of statistical information for researchers, did a study that showed that the most common method actually gives worse results that listwise deletion. (Did I mention I’ve used it myself?) (more…)

Logistic regression models can seem pretty overwhelming to the uninitiated. Why not use a regular regression model? Just turn Y into an indicator variable–Y=1 for success and Y=0 for failure.

For some good reasons.

1.It doesn’t make sense to model Y as a linear function of the parameters because Y has only two values. You just can’t make a line out of that (at least not one that fits the data well).

2. The predicted values can be any positive or negative number, not just 0 or 1.

3. The values of 0 and 1 are arbitrary.The important part is not to predict the numerical value of Y, but the probability that success or failure occurs, and the extent to which that probability depends on the predictor variables.

So okay, you say. Why not use a simple transformation of Y, like probability of success–the probability that Y=1.

Well, that doesn’t work so well either.

Why not?

1. The right hand side of the equation can be any number, but the left hand side can only range from 0 to 1.

2. It turns out the relationship is not linear, but rather follows an S-shaped (or sigmoidal) curve.

To obtain a linear relationship, we need to transform this response too, Pr(success).

As luck would have it, there are a few functions that:

1. are not restricted to values between 0 and 1

2. will form a linear relationship with our parameters

These functions include:

•Arcsine

•Probit

•Logit

All three of these work just as well, but (believe it or not) the Logit function is the easiest to interpret.

But as it turns out, you can’t just run the transformation then do a regular linear regression on the transformed data. That would be way too easy, but also give inaccurate results. Logistic Regression uses a different method for estimating the parameters, which gives better results–better meaning unbiased, with lower variances.

When the dependent variable in a regression model is a proportion or a percentage, it can be tricky to decide on the appropriate way to model it.

The big problem with ordinary linear regression is that the model can predict values that aren’t possible–values below 0 or above 1. But the other problem is that the relationship isn’t linear–it’s sigmoidal. A sigmoidal curve looks like a flattened S–linear in the middle, but flattened on the ends. So now what?

The simplest approach is to do a linear regression anyway. This approach can be justified only in a few situations.

1. All your data fall in the middle, linear section of the curve. This generally translates to all your data being between .2 and .8 (although I’ve heard that between .3-.7 is better). If this holds, you don’t have to worry about the two objections. You do have a linear relationship, and you won’t get predicted values much beyond those values–certainly not beyond 0 or 1.

2. It is a really complicated model that would be much harder to model another way. If you can assume a linear model, it will be much easier to do, say, a complicated mixed model or a structural equation model. If it’s just a single multiple regression, however, you should look into one of the other methods.

A second approach is to treat the proportion as a binary response then run a logistic or probit regression. This will only work if the proportion can be thought of and you have the data for the number of successes and the total number of trials. For example, the proportion of land area covered with a certain species of plant would be hard to think of this way, but the proportion of correct answers on a 20-answer assessment would.

The third approach is to treat it the proportion as a censored continuous variable. The censoring means that you don’t have information below 0 or above 1. For example, perhaps the plant would spread even more if it hadn’t run out of land. If you take this approach, you would run the model as a two-limit tobit model (Long, 1997). This approach works best if there isn’t an excessive amount of censoring (values of 0 and 1).

Reference: Long, J.S. (1997). Regression Models for Categorical and Limited Dependent Variables. Sage Publishing.