Last month I did a webinar on Poisson and negative binomial models for count data. With a few hundred participants, we ran out of time to get through all the questions, so I’m answering some of them here on the blog.
This set of questions are all related to when it’s appropriate to treat count data as continuous and run the more familiar and simpler linear model.
Q: Do you have any guidelines or rules of thumb as far as how many discrete values an outcome variable can take on before it makes more sense to just treat it as continuous?
The issue usually isn’t a matter of how many values there are. I see what you mean in that a discrete scale that goes from 0 to 8 for example feels more discrete because there are only nine possible values, compared to a discrete scale that goes from 0 to 200. 201 values just feels more continuous.
But that’s not really the issue in most count models. The issue with count variables is that they bounded at zero. This wreaks havoc on the assumptions of a linear model, which require continuous data.
If none of your data are near zero, it would be less of an issue. Treating that count variable as continuous would give you predicted values that are non-integers, but perhaps that’s not a big issue in your particular data set.
Q: How high does the count scale have to be before you can consider it continuous?
I suspect you’re getting at the same issue as in the last question. It’s certainly true that when you get into very large numbers, many of the issues with count variables aren’t issues anymore.
For example, most incomes are not measured using decimals, just whole numbers. You could consider them a count of the number of dollars. Likewise, demographic variables like the number of children vaccinated in a state over the course of the year are truly counts, but the smallest values are likely to be in the hundreds of thousands, or even millions.
As long as there are no data along the bound of zero, and you don’t mind predicted values that include decimals, there’s no problem treating it as continuous.
Q: For count data distributing not skewed, but in a symmetric/or even normal shape, are poisson and NB still the best choice?
Sometimes. The Poisson distribution is only skewed when the mean is very small. When the mean gets up to only 10, the distribution will become symmetric and bell shaped.
Depending on the effects of the predictors, and actual range of the data, i.e. whether there are actual 0 pounds or not, you may get identical results from running a linear model compare to a Poisson or negative binomial model.
If you do run a linear model, it will be possible to get predictive values below zero, and you need to consider whether that’s problematic in your situation. If the point of your model is prediction, it may be more of an issue.
Q: If count data can be normalized by log transformation, will you recommend using poisson or linear regression?
It’s never wrong to run a Poisson model, so what you’re asking is if the increased accuracy is worth the trouble of running the more complicated model. There are certainly cases where running a linear model simplifies things a lot and still gives you the same results. (You just won’t know you have the same results unless you run both).
When the mean count is very small, and zero is the most common value in the data set, it will be impossible to normalize using a log transformation. It just won’t work. The mode will always be at the lowest value. In that situation, you have no choice.
However, if the mean count is a little bit larger, zero may not be the most common value. When the mode is not the lower bound, it will be possible to use a log transformation to normalize the data. In fact, not too many years ago, when Poisson and negative binomial models were not readily available in software, textbooks did suggest this approach. You may still have some on your shelf that do so.
It’s not necessarily a bad approach. You may very well get the exact same results. If so, and if, for example, you are writing a report for an audience with out the statistical sophistication to understand the Poisson model, it may be a better choice.
However, it’s not exactly the same thing. A Poisson model uses a log link function, which applies the log to the mean–not each individual data point. So it’s harder to back transform coefficients when you’ve got a log transformation. So if there are not major advantages to running a linear model, you are usually better off with a more sophisticated and more accurate Poisson model, or one of its derivatives.
If you’d like to learn more about the different models available for Count data, you can download a recording of the webinar: Poisson and Negative Binomial Regression for Count Data. It’s free.