You’ve probably experienced this before. You’ve done a statistical analysis, you’ve figured out all the steps, you finally get results and are able to interpret them. But the statistical results just look…wrong. Backwards, or even impossible—theoretically or logically.
This happened a few times recently to a couple of my consulting clients, and once to me. So I know that feeling of panic well. There are so many possible causes of incorrect results, but there are a few steps you can take that will help you figure out which one you’ve got and how (and whether) to correct it.
Errors in Data Coding and Entry
In both of my clients’ cases, the problem was that they had coded missing data with an impossible and extreme value, like 99. But they failed to define that code as missing in SPSS. So SPSS took 99 as a real data point, which (more…)
One of the biggest questions I get is about the difference between mediators, moderators, and how they both differ from control variables.
I recently found a fabulous free video tutorial on the difference between mediators, moderators, and suppressor variables, by Jeremy Taylor at Stats Make Me Cry. The witty example is about the different types of variables–talent, practice, etc.–that explain the relationship between having a guitar and making lots of $$.
While there are a number of distributional assumptions in regression models, one distribution that has no assumptions is that of any predictor (i.e. independent) variables.
It’s because regression models are directional. In a correlation, there is no direction–Y and X are interchangeable. If you switched them, you’d get the same correlation coefficient.
But regression is inherently a model about the outcome variable. What predicts its value and how well? The nature of how predictors relate to it (more…)
Here’s a little tip.
When you construct Dummy Variables, make it easy on yourself to remember which code is which. Heck, if you want to be really nice, make it easy for anyone else who will analyze the data or read the results.
Make the codes inherent in the Dummy variable name.
So instead of a variable named Gender with values of 1=Female and 0=Male, call the variable Female.
Instead of a set of dummy variables named MaritalStatus1 with values of 1=Married and 0=Single, along with MaritalStatus2 with values 1=Divorced and 0=Single, name the same variables Married and Divorced.
And if you’re new to dummy coding, this has the extra bonus of making the dummy coding intuitive. It’s just a set of yes/no variables about all but one of your categories.
I know you know it–those assumptions in your regression or ANOVA model really are important. If they’re not met adequately, all your p-values are inaccurate, wrong, useless.
But, and this is a big one, linear models are robust to departures from those assumptions. Meaning, they don’t have to fit exactly for p-values to be accurate, right, and useful.
You’ve probably heard both of these contradictory statements in stats classes and a million other places, and they are the kinds of statements that drive you crazy. Right?
I mean, do statisticians make this stuff up just to torture researchers? Or just to keep you feeling stupid?
No, they really don’t. (I promise!) And learning how far you can push those robust assumptions isn’t so hard, with some training and a little practice. Over the years, I’ve found a few mistakes researchers commonly make because of one, or both, of these statements:
1. They worry too much about the assumptions and over-test them. There are some nice statistical tests to determine if your assumptions are met. And it’s so nice having a p-value, right? Then it’s clear what you’re supposed to do, based on that golden rule of p<.05.
The only problem is that many of these tests ignore that robustness. They find that every distribution is non-normal and heteroskedastic. They’re good tools, but these hammers think every data set is a nail. You want to use the hammer when needed, but don’t hammer everything.
2.They assume everything is robust anyway, so they don’t test anything. It’s easy to do. And once again, it probably works out much of the time. Except when it doesn’t.
Yes, the GLM is robust to deviations from some of the assumptions. But not all the way, and not all the assumptions. You do have to check them.
3. They test the wrong assumptions. Look at any two regression books and they’ll give you a different set of assumptions.
This is partially because many of these “assumptions” need to be checked, but they’re not really model assumptions, they’re data issues. And it’s also partially because sometimes the assumptions have been taken to their logical conclusions. That textbook author is trying to make it more logical for you. But sometimes that just leads you to testing the related, but wrong thing. It works out most of the time, but not always.
In many research fields, a common practice is to categorize continuous predictor variables so they work in an ANOVA. This is often done with median splits. This is a way of splitting the sample into two categories: the “high” values above the median and the “low” values below the median.
Reasons Not to Categorize a Continuous Predictor
There are many reasons why this isn’t such a good idea: (more…)