When your dependent variable is not continuous, unbounded, and measured on an interval or ratio scale, your model will not meet the assumptions of linear models.
Today I’m going to go into more detail about 6 common types of dependent variables that are not continuous, unbounded, and measured on an interval or ratio scale and the tests that work instead.
Side note: the usual advice is to use nonparametric tests when normality assumptions fail. That works when you’re doing something simple, like a correlation or comparing group means. But if you’re including covariates or interactions in a model, you need a real model.
Both binary (2 values) and multicategory (3 or more values) variables clearly fail all three criteria. But there are other types of regression models that work just fine for these variables.
These variables are made up of ordered categories. They include rank and likert-item variables, although are not limited to these.
Although ordinal variables look like numbers, the distances between their values aren’t equal in a true numerical sense, so it doesn’t make sense to apply numerical operations, like addition and division, to them. Hence means, the basis of linear models, don’t really compute.
Like unordered categorical variables, ordinal variables require specialized logistic or probit models, such as the proportional odds model. There are a few other types of ordinal models, but the proportional odds model is most commonly available.
Discrete counts fail the assumptions of linear models for many reasons. The most obvious is that the normal distribution of linear models allows any value on the number scale, but counts are bounded at 0. It just doesn’t make sense to predict negative numbers of cigarettes smoked each day, children in a family, or aggressive incidents.
But Poisson regression, or related models like negative binomial, are designed to accurately model count data.
Zero Inflated Variables
Zero Inflated data have a spike in the distribution at 0.
They are common in Poisson data, but can occur with any distribution. A recent example I saw were scores on a depression scale. The scale ran from 0 to 20, and 0 was by far the most common value (which is a good thing for the state of humanity, but it really messes up the linear model assumptions).
Even if the rest of the distribution is normal, you can’t transform zero inflated data to look normal. A Zero-Inflated model, however, incorporates the high number of zeros by simultaneously modeling 0/Not 0 as a logistic regression and all the Not 0 values as another distribution. It’s pretty cool, actually.
Censored data have full information about the values of the DV only for some values. The distribution gets cut off for some values, often at the end of the distribution.
Examples include surveys that have exact income information for everyone up to $200k, but beyond that, everyone is just given “over $200k.” In surveys, this is done for privacy issues–there just aren’t many people with such high incomes.
But sometimes it’s just a measurement issue. Tobit regression models are designed to handle the imprecise measurements on some parts of the scale.
Proportions, bounded at 0 and 1, or percentages, bounded at 0 and 100, really become problematic if much of the data are close to the bounds.
If all the data fall in the middle portion, say in the .2 to .8 range, a linear model can give reasonably good results. But beyond that, you need to either use a beta regression if the proportion is continuous or logistic regression if the proportion measures discrete events with a certain outcome (proportion of questions answered correctly).