When your dependent variable is not continuous, unbounded, and measured on an interval or ratio scale, your model will never meet the Assumptions of the General Linear Model (GLM). Today I’m going to go into more detail about these 6 common types of dependent variables, and the tests that work instead.
Categorical Variables, including both binary (with 2 values) and multicategory (with 3 or more values) clearly fail all three criteria. But there are a number of other types of regression models that do fit these variables.
Ordinal Variables are ordered categories. They include rank and likert-item variables, although are not limited to these. Although ordinal variables look like numbers, the distance between their values isn’t equal in a true numerical sense, so it doesn’t make sense to apply numerical operations, like addition and division, to them. Like unordered categorical variables, ordinal variables require specialized logistic or probit models, such as the proportional odds model.
Discrete counts fail the assumptions of a GLM for many reasons. The most obvious is that the normal distribution of a GLM allows any value on the number scale, but counts are bounded at 0. It just doesn’t make sense to predict negative numbers of cigarettes smoked each day, children in a family, aggressive incidents.
But Poisson regression, or one of its brethren, are designed to accurately model count data.
Zero Inflated data have a huge spike in the distribution at 0. They are common in Poisson models, but can occur with any distribution. A recent example I saw were scores on a depression scale. The scale ran from 0 to 20, and 0 was by far the most common value (which is a good thing for the state of humanity, but it really messes up the GLM). Even if the rest of the distribution is normal, you can’t transform zero inflated data to look normal. A Zero-Inflated model, however, incorporates the high number of zeros by simultaneously modeling 0/Not 0 as a logistic regression and all the Not 0 values as another distribution. It’s pretty cool, actually.
Censored or truncated data have full information about the values of the DV only for some values. The distribution gets cut off for some values, often at the end of the distribution. Examples include surveys that have exact income information for everyone up to $200k, but beyond that, everyone is just given “over $200k.” In surveys, this is done for privacy issues–there just aren’t many people with such high incomes. But sometimes it’s just a measurement issue. Tobit regression models are designed to handle the imprecise measurements on some parts of the scale.
Proportion data, bounded at 0 and 1, or percentage data, bounded at 0 and 100, really become problematic if much of the data are close to the bounds. If all the data fall in the middle portion, say in the .2 to .8 range, a GLM can give reasonably good results. But beyond that, you need to either use a probit or logistic regression if the proportion measures discrete events (proportion of questions answered correctly) or a tobit regression if the proportion measures a continuous entity (proportion of time spent studying).