Last month I did a webinar on Poisson and negative binomial models for count data. With a few hundred participants, we ran out of time to get through all the questions, so I’m answering some of them here on the blog.
This set of questions are all related to when it’s appropriate to treat count data as continuous and run the more familiar and simpler linear model.
Q: Do you have any guidelines or rules of thumb as far as how many discrete values an outcome variable can take on before it makes more sense to just treat it as continuous?
The issue usually isn’t a matter of how many values there are. I see what you mean in that a discrete scale that goes from 0 to 8 for example feels more discrete because there are only nine possible values, compared to a discrete scale that goes from 0 to 200. 201 values just feels more continuous.
But that’s not really the issue in most count models. The issue with count variables is that they bounded at zero. This wreaks havoc on the assumptions of a linear model, which require continuous data.
If none of your data are near zero, it would be less of an issue. Treating that count variable as continuous would give you predicted values that are non-integers, but perhaps that’s not a big issue in your particular data set.
Q: How high does the count scale have to be before you can consider it continuous?
I suspect you’re getting at the same issue as in the last question. It’s certainly true that when you get into very large numbers, many of the issues with count variables aren’t issues anymore.
For example, most incomes are not measured using decimals, just whole numbers. You could consider them a count of the number of dollars. Likewise, demographic variables like the number of children vaccinated in a state over the course of the year are truly counts, but the smallest values are likely to be in the hundreds of thousands, or even millions.
As long as there are no data along the bound of zero, and you don’t mind predicted values that include decimals, there’s no problem treating it as continuous.
Q: For count data distributing not skewed, but in a symmetric/or even normal shape, are poisson and NB still the best choice?
Sometimes. The Poisson distribution is only skewed when the mean is very small. When the mean gets up to only 10, the distribution will become symmetric and bell shaped.
Depending on the effects of the predictors, and actual range of the data, i.e. whether there are actual 0 pounds or not, you may get identical results from running a linear model compare to a Poisson or negative binomial model.
If you do run a linear model, it will be possible to get predictive values below zero, and you need to consider whether that’s problematic in your situation. If the point of your model is prediction, it may be more of an issue.
Q: If count data can be normalized by log transformation, will you recommend using poisson or linear regression?
It’s never wrong to run a Poisson model, so what you’re asking is if the increased accuracy is worth the trouble of running the more complicated model. There are certainly cases where running a linear model simplifies things a lot and still gives you the same results. (You just won’t know you have the same results unless you run both).
When the mean count is very small, and zero is the most common value in the data set, it will be impossible to normalize using a log transformation. It just won’t work. The mode will always be at the lowest value. In that situation, you have no choice.
However, if the mean count is a little bit larger, zero may not be the most common value. When the mode is not the lower bound, it will be possible to use a log transformation to normalize the data. In fact, not too many years ago, when Poisson and negative binomial models were not readily available in software, textbooks did suggest this approach. You may still have some on your shelf that do so.
It’s not necessarily a bad approach. You may very well get the exact same results. If so, and if, for example, you are writing a report for an audience with out the statistical sophistication to understand the Poisson model, it may be a better choice.
However, it’s not exactly the same thing. A Poisson model uses a log link function, which applies the log to the mean–not each individual data point. So it’s harder to back transform coefficients when you’ve got a log transformation. So if there are not major advantages to running a linear model, you are usually better off with a more sophisticated and more accurate Poisson model, or one of its derivatives.
If you’d like to learn more about the different models available for Count data, you can download a recording of the webinar: Poisson and Negative Binomial Regression for Count Data. It’s free.





{ 2 comments… read them below or add one }
Dear Karen,
Thank you for this excellent blog post. I look forward to listening to the webinar (couldn’t be tune-in in real time). I was what do do if you have count data that has a “large” mean (~10-12) but still many zeros? It could be considered zero inflated if there are “too many” zeros for a poisson distribution with a mean of 10 or it could be considered over dispersed (too many high values for the rest of the distribution). I have heard that some people don’t like NB regressions and quasi-poisson has fallen out of favor for mixed models (even removed from the lme4 package in R). If a log transformed regression resulted in reasonable residual plots, is that a case when it would be better than trying to deal with a GLMM?
Thanks,
Dan
Hi Dan,
It might, although if you have a lot of zeros, a log transform won’t result in normal residual plots, the way it would with a high mean and not a lot of zeros.
I’m not sure where the mixed model is coming into it–I’m guessing you have a design with clustered count data.
It’s always difficult to advise on what’s an appropriate analysis without all the details–for example, if you’re submitting to a high level journal, you ought to use the more sophisticated technique, even if you’re getting the exact same results and reasonable residuals from the linear model. If the answer is more important than how you get to it, then it would be fine. Unfortunately, you probably can’t tell if the linear mixed model is good enough without running the GLMM.
Karen