You’ve probably experienced this before. You’ve done a statistical analysis, you’ve figured out all the steps, you finally get results and are able to interpret them. But they just look…wrong. Backwards, or even impossible—theoretically or logically.

This happened a few times recently to a couple of my consulting clients, and once to me. So I know that feeling of panic well. There are so many possible causes of incorrect results, but there are a few steps you can take that will help you figure out which one you’ve got and how (and whether) to correct it.

**Errors in Data Coding and Entry**

In both of my clients’ cases, the problem was that they had coded missing data with an impossible and extreme value, like 99. But they failed to define that code as missing in SPSS. So SPSS took 99 as a real data point, which overwhelmed the analysis and created a seemingly negative relationship that shouldn’t be there.

Likewise, plain old data entry mistakes, like that observation of the child in 28th grade, or more mundane coding errors, like forgetting to reverse code the negatively-worded items on a scale, are quite common. They may or may not be outliers in a univariate sense, but they’ll often be in the wrong place if you look at bivariate graphs.

**Misinterpretations**

Sometimes, the results aren’t wrong—you’re making mistakes in reading them. Read the output carefully, and check defaults in your software’s manual. Did it do what you think it did? Even in the same statistical package, the defaults differ across procedures.

This is what happened in my recent mixup. I was running a logistic regression in a procedure I don’t use often—GEE. The default reference group for the binary outcome variable was opposite what I’m used to. The results weren’t wrong—I just misinterpreted them.

Not all misinterpretations come from software defaults—some come from the way the statistics are calculated. Regression coefficients can be particularly tricky on this one—they change meaning depending on other terms in the model. For example, with an interaction term in a regression model, coefficients of the component terms are NOT main effects, as they are without the interaction. So including an interaction can easily reverse or otherwise drastically change what looks like the same coefficient.

**Misspecifying the model**

It may be that the model you proposed isn’t the best model for the data. The results look strange because they’re not very accurate. There may exist effects you didn’t include, like interactions, non-linear effects, or important control variables. Or a different type of model may be more appropriate—a Poisson, a mixed model—for your design and variables.

It’s not cheating or data fishing to treat your first model as a starting point, then refine it. Fishing is trying every possible thing you can to get a significant result. Good data analysis is refining a model to improve the fit in a way that takes theory, logic, and model assumptions into account.

**Bigger Data Issues**

A bigger problem with missing data is that the default in most software is to drop any case with any missing data. In multivariate analyses, this sometimes results in a lot of data getting dropped even if the percentage of missing data is small on any one variable. Depending on the patterns of missing data, it can cause analysis results to be totally off.

Another possibility is severe multicollinearity. One of its consequences is that it causes regression coefficients to go wonky (yes, that’s a technical term). Specifically, they’ll reverse sign or get very extreme, their standard errors will be huge, and p-values will be close or equal to 1. So if these three things happen together, multicollinearity is a likely cause. But remember, extreme situations like severe multicolliearity are dramatic, but they’re also the least common situation. Data coding errors and missing data are much more likely.

**The Steps**

How do you figure out which problem you have? Just by doing the steps you *ought* to be doing anyway, whether the results look wrong or not. Chances are, you’ve skipped one or more of these steps:

1. Run univariate and bivariate descriptive statistics and graphs. Most coding errors and some model misspecifications will show up clearly.

2. Read output and syntax carefully. Check and recheck the defaults and how your software refers to the output. Make sure you know what each piece of output really means.

3. Check model assumptions and diagnose data issues like multicollinearity and missing data. Most model misspecifications will appear in model diagnostics.

And finally, consider the possibility that the unexpected result is correct. If you’ve gone through all the diagnoses thoroughly and you can be confident there aren’t any errors, accept the unexpected results. They’re often more interesting.

{ 1 comment… read it below or add one }

I am doing thesis on Insecticide treated net utilization among under-five and Household net ownership among Households with under five children.

the problem, I asked all children in the household in ascending order and inserted the data in spss according to the following.

did the first child slept under net,0=no,1=yes

did the second child slept under net?0=no,1=yes

did the third child slept under net?0=no,1=yes including age and sex for each child.It becomes difficulty which one to use as dependent variable? Is there a way to combine them?