Last month I did a webinar on Poisson and negative binomial models for count data. With a few hundred participants, we ran out of time to get through all the questions, so I’m answering some of them here on the blog.
This set of questions are all related to when it’s appropriate to treat count data as continuous and run the more familiar and simpler linear model.
Q: Do you have any guidelines or rules of thumb as far as how many discrete values an outcome variable can take on before it makes more sense to just treat it as continuous?
The issue usually isn’t a matter of how many values there are.
Need to dummy code in a Cox regression model?
Interpret interactions in a logistic regression?
Add a quadratic term to a multilevel model?
This is where statistical analysis starts to feel really hard. You’re combining two difficult issues into one.
You’re dealing with both a complicated modeling technique at Stage 3 (survival analysis, logistic regression, multilevel modeling) and tricky effects in the model (dummy coding, interactions, and quadratic terms).
The only way to figure it all out in a situation like that is to break it down into parts. (more…)
1. For a general overview of modeling count variables, you can get free access to the video recording of one of my The Craft of Statistical Analysis Webinars:
Poisson and Negative Binomial for Count Outcomes
2. One of my favorite books on Categorical Data Analysis is:
Long, J. Scott. (1997). Regression models for Categorical and Limited Dependent Variables. Sage Publications.
It’s moderately technical, but written with social science researchers in mind. It’s so well written, it’s worth it. It has a section specifically about Zero Inflated Poisson and Zero Inflated Negative Binomial regression models.
3. Slightly less technical, but most useful only if you use Stata is Regression Models for Categorical Dependent Variables Using Stata, by J. Scott Long and Jeremy Freese.
4. UCLA’s ATS Statistical Software Consulting Group has some nice examples of Zero-Inflated Poisson and other models in various software packages.
There are quite a few types of outcome variables that will never meet ordinary linear model’s assumption of normally distributed residuals. A non-normal outcome variable can have normally distribued residuals, but it does need to be continuous, unbounded, and measured on an interval or ratio scale. Categorical outcome variables clearly don’t fit this requirement, so it’s easy to see that an ordinary linear model is not appropriate. Neither do count variables. It’s less obvious, because they are measured on a ratio scale, so it’s easier to think of them as continuous, or close to it. But they’re neither continuous or unbounded, and this really affects assumptions.
Continuous variables measure how much. Count variables measure how many. Count variables can’t be negative—0 is the lowest possible value, and they’re often skewed–so severly that 0 is by far the most common value. And they’re discrete, not continuous. All those jokes about the average family having 1.3 children have a ring of truth in this context.
Count variables often follow a Poisson or one of its related distributions. The Poisson distribution assumes that each count is the result of the same Poisson process—a random process that says each counted event is independent and equally likely. If this count variable is used as the outcome of a regression model, we can use Poisson regression to estimate how predictors affect the number of times the event occurred.
But the Poisson model has very strict assumptions. One that is often violated is that the mean equals the variance. When the variance is too large because there are many 0s as well as a few very high values, the negative binomial model is an extension that can handle the extra variance.
But sometimes it’s just a matter of having too many zeros than a Poisson would predict. In this case, a better solution is often the Zero-Inflated Poisson (ZIP) model. (And when extra variation occurs too, its close relative is the Zero-Inflated Negative Binomial model).
ZIP models assume that some zeros occurred by a Poisson process, but others were not even eligible to have the event occur. So there are two processes at work—one that determines if the individual is even eligible for a non-zero response, and the other that determines the count of that response for eligible individuals.
The tricky part is either process can result in a 0 count. Since you can’t tell which 0s were eligible for a non-zero count, you can’t tell which zeros were results of which process. The ZIP model fits, simultaneously, two separate regression models. One is a logistic or probit model that models the probability of being eligible for a non-zero count. The other models the size of that count.
Both models use the same predictor variables, but estimate their coefficients separately. So the predictors can have vastly different effects on the two processes.
But a ZIP model requires it be theoretically plausible that some individuals are ineligible for a count. For example, consider a count of the number of disciplinary incidents in a day in a youth detention center. True, there may be some youth who would never instigate an incident, but the unit of observation in this case is the center. It is hard to imagine a situation in which a detention center would have no possibility of any incidents, even if they didn’t occur on some days.
Compare that to the number of alcoholic drinks consumed in a day, which could plausibly be fit with a ZIP model. Some participants do drink alcohol, but will have consumed 0 that day, by chance. But others just do not drink alcohol, so will never have a non-zero response. The ZIP model can determine which predictors affect the probability of being an alcohol consumer and which predictors affect how many drinks the consumers consume. They may not be the same predictors for the two models, or they could even have opposite effects on the two processes.
Here’s a little tip.
When you construct Dummy Variables, make it easy on yourself to remember which code is which. Heck, if you want to be really nice, make it easy for anyone else who will analyze the data or read the results.
Make the codes inherent in the Dummy variable name.
So instead of a variable named Gender with values of 1=Female and 0=Male, call the variable Female.
Instead of a set of dummy variables named MaritalStatus1 with values of 1=Married and 0=Single, along with MaritalStatus2 with values 1=Divorced and 0=Single, name the same variables Married and Divorced.
And if you’re new to dummy coding, this has the extra bonus of making the dummy coding intuitive. It’s just a set of yes/no variables about all but one of your categories.
Someone who registered for my upcoming Interpreting (Even Tricky) Regression Models workshop asked if the content applies to logistic regression as well.
The short answer: Yes
The long-winded detailed explanation of why this is true and the one caveat:
One of the greatest things about regression models is that they all have the same set up: (more…)