Last month I did a webinar on Poisson and negative binomial models for count data. With a few hundred participants, we ran out of time to get through all the questions, so I’m answering some of them here on the blog.
This set of questions are all related to when it’s appropriate to treat count data as continuous and run the more familiar and simpler linear model.
Q: Do you have any guidelines or rules of thumb as far as how many discrete values an outcome variable can take on before it makes more sense to just treat it as continuous?
The issue usually isn’t a matter of how many values there are.
0 to 8 for example feels more discrete because there are only nine possible values, compared to a discrete scale that goes from 0 to 200. 201 values just feels more continuous.
The issue with count variables is that they bounded at zero. This wreaks havoc on the assumptions of a linear model, which require continuous data.
If none of your data are near zero, it would be less of an issue. Treating that count variable as continuous would give you predicted values that are non-integers, but perhaps that’s not a big issue in your particular data set.
Q: How high does the count scale have to be before you can consider it continuous?
I suspect you’re getting at the same issue as in the last question. It’s certainly true that when you get into very large numbers, many of the issues with count variables aren’t issues anymore.
For example, most incomes are not measured using decimals, just whole numbers. You could consider them a count of the number of dollars. Likewise, demographic variables like the number of children vaccinated in a state over the course of the year are truly counts, but the smallest values are likely to be in the hundreds of thousands, or even millions.
As long as there are no data along the bound of zero, and you don’t mind predicted values that include decimals, there’s no problem treating it as continuous.
Q: For count data distributing not skewed, but in a symmetric/or even normal shape, are poisson and NB still the best choice?
Sometimes. The Poisson distribution is only skewed when the mean is very small. When the mean gets up to only 10, the distribution will become symmetric and bell shaped.
Depending on the effects of the predictors, and actual range of the data, i.e. whether there are actual 0 pounds or not, you may get identical results from running a linear model compare to a Poisson or negative binomial model.
If you do run a linear model, it will be possible to get predictive values below zero, and you need to consider whether that’s problematic in your situation. If the point of your model is prediction, it may be more of an issue.
Q: If count data can be normalized by log transformation, will you recommend using poisson or linear regression?
It’s never wrong to run a Poisson model, so what you’re asking is if the increased accuracy is worth the trouble of running the more complicated model. There are certainly cases where running a linear model simplifies things a lot and still gives you the same results. (You just won’t know you have the same results unless you run both).
When the mean count is very small, and zero is the most common value in the data set, it will be impossible to normalize using a log transformation. It just won’t work. The mode will always be at the lowest value. In that situation, you have no choice.
However, if the mean count is a little bit larger, zero may not be the most common value. When the mode is not the lower bound, it will be possible to use a log transformation to normalize the data. In fact, not too many years ago, when Poisson and negative binomial models were not readily available in software, textbooks did suggest this approach. You may still have some on your shelf that do so.
It’s not necessarily a bad approach. You may very well get the exact same results. If so, and if, for example, you are writing a report for an audience with out the statistical sophistication to understand the Poisson model, it may be a better choice.
However, it’s not exactly the same thing. A Poisson model uses a log link function, which applies the log to the mean–not each individual data point. So it’s harder to back transform coefficients when you’ve got a log transformation. So if there are not major advantages to running a linear model, you are usually better off with a more sophisticated and more accurate Poisson model, or one of its derivatives.
If you’d like to learn more about the different models available for Count data, you can download a recording of the webinar: Poisson and Negative Binomial Regression for Count Data. It’s free.
Letoynia Coombs says
Is Poisson appropriate when your count is bounded on the upper end? For example a count that ranges from 0 to 4 with no higher count possible. Perhaps, I should just treat this as ordinal. Thanks.
Interested in this question as well. I have a count variable that ranges from 0 to 5 with no higher count possible. I’m struggling to find out whether to use Poisson model or ordinal model in my analysis.
I’d have to see it to really advise, but if they’re truly counts and range from 0 to 5, you probably want a Poisson.
Laura Bedford says
I am analysing the results of a pre and post randomised controlled trial and have 12 months of count data (6 months pre, 3 months intervention period and 3 months post intervention). My one IV is categorical –Experimental/Control. My other factor is time — month. Some outcome count variables represent number of events per individual in the trial (discrete and bounded at 0) and some represent number of hours (continuous and bounded at 0). I would normally have opted for a mixed factorial ANOVA, but becasue of the Poisson distribution and bounded nature of the outcome variables. I am not sure how to factor in the time variable into a Poisson regression. Do you have any ideas?
My DV consists of 12 items with 6 sub-scales to be analysed individually and against each other; e.g., through a 2(gender) x 6 (question category) ANOVA.
Each of the 6 categories have a score which ranges from 0 to 2, 2 being the highest score and 0 being the least score. Across the 6 categories most participants score 0, which makes the data highly skewed (L-shaped) when tested for normality. I have tried a log 10 transformation but it didn’t work and I am wondering what my best option is to get the data normally distributed enough to run an ANOVA
I’ve found that when counts are high (high mean, many high observations) the poisson models can fail. I have even generated count data with a specific structure and run Poisson models on them, using both Limdep and STATA and the models do not come up with the right coefficients — not even close, and sometimes not even the right sign. I talked to our local econometrics guru about this (William Greene, author of Limdep and author of Econometric Analysis) who pointed out that Poisson really only works for counts of less than a few dozen. Above that, the function starts to collapse in most programs. Sure enough, if you put the formula for the probability mass function into excel, and then plug in some simulated count data with a high mean and many high counts (I think I used a mean of about 90 and my counts went as high as 900, consistent with some of the patent data I work with), the probability mass function generated all zeros and errors. I thus think it might be worth warning people about the limitations of the Poisson and Negbin models.
is it possible to use negative binomial regression on count variable as a continuum between two poles. I have counts of data but am interested in understanding how my independent variable influences policy convergence which is based on counts of words per year (does it lean more right or left?)
Could you please tell me if intraclass correlation coefficients can be used with count data (in my analysis, the count variable takes values ranging from 0 to 3 and my sample size is 60).
Thanks a lot for your help,
Chacha Nyangi says
Hi Karen aim trying to analyse my data using SAS (GENMOD) all independent variables are categorical with different levels, while the 2 dependent variables are continous (Levels of aflatoxins and fumonisins in Maize samples). Can you please advice if Genmode is appropriate and how I can transform the data to fit linear model.
Iam still learning SAS but I have to use it for my MSc Thesis
Hayley McBain says
Great website and webinars. I need to conduct mediation analysis but my outcome (healthcare utilisation) is count data. Neither MPlus or SPSS can calculate the indirect effect or bootstrap the SE and CIs. Do you know of any other programs which are able to deal with this? My other option is to treat the outcome as continuous, as I’m not interested in the predicted values this maybe my only solution. Unless you know of another?
Thank you for your help.
I don’t know of one. I would check Andrew Hayes’ macros. http://www.afhayes.com/spss-sas-and-mplus-macros-and-code.html
Very useful blog. I am trying to perform a regression using Poisson and then using NB (due to overdispersion). y~x where both are very high count data (they’re fish catches).
y: min: 136,000, max: 1,219,000, mean: 515,900
x: min 24000, max: 607,500, mean: 197,600
The data aren’t bounded at zero so I could model these as normal (there’s autocorrelation, so using a linear model could be a way to deal with this)
My question is: are there any references, papers, journals or articles that actually say you can treat the count variables as continuous, and why? Is there some sort of theoretical motivation for this, perhaps due to the factorial in the Poisson distribution?
Usually when counts are that high, the distribution is indistinguishable from a normal. Although theoretically the values are discrete, the lack of continuity isn’t really noticable for the values.
I know there are books that say the Poisson distribution approximates a normal. I would look for something that says that. Any theoretical stats book would have that.
I am currently working on a personal improvement program. I have 3 separate variables that I have collected data on: Steps taken throughout the day (the target is at least 8,000 steps), Exercise-Yes/No, and Amount of sleep each night. Amount of sleep each night is a continuous variable while Exercise is a discrete variable. Here are my questions:
Is steps taken throughout the day a continuous variable?
Do I create one control chart for each of variables? Or Do I create 3 separate control charts?
Also, If I combine them into one chart, what variables would I put where in control charts in SPSS?
Thank you sooooo much for your help!!!
Hmm, sounds like a homework question….
How great to find this website! I have count data (number of times a firm was sued) which has tons of zeros so I was considering running a negative binomial regression. I would normally scale the counts (e.g. number of lawsuits/number of divisions in the company – which gets me average number of lawsuits per division) Can I even use the latter variable? There would still be a ton of zeros so I’m thinking that using a nbr is still the way to go. Just wondering about the ‘legality’ of scaling the count variable first. Is it also ok to not scale it? How would I make that decision?
Thank you in advance for your help!!
You can still scale it, but not directly. You need to use an exposure variable. See this article: https://www.theanalysisfactor.com/the-exposure-variable-in-poission-regression-models/
How can i relate number of steps in 3 days with age and bmi. can i use poisson regression ? or should i change steps to active/non active and do logistic regression.thanks
Either is theoretically possible. I’d have to run some models and graphs to decide on the best course of action. It’s also possible to run a linear model if number of steps is generally large enough. I’m guessing you’re measuring number of steps taken in a day in adults, which probably doesn’t have a lot of 0s (although it could in some subpopulations). The Poisson distribution approximates a normal when the mean is above 10 or so.
Bridget Ryan says
I don’t know if you are still monitoring this blog but I want to run a negative binomial rather than Poisson on a count outcome (# healthcare visits)because of over dispersion but my data is clustered (patients within clinics). I can’t find a software that will run multilevel negative binomial. Stata does multilevel Poisson but not NB. Any ideas?
Yes. There is an xtnbreg function in Stata that fits a conditional fixed effects model for multilevel data, although you may have to download in as an add-on. http://works.bepress.com/joseph_hilbe/16/
There’s a whole chapter on it in Hilbe’s book:Negative Binomial Regression. It’s really good and I highly recommend it.
Thanks very much Karen. I will read this book. I enjoy your website and workshops.Thanks very much and happy holidays.
Please forgive me if the question is dumb.
I have a survey data which is the share of innovative sales in total turnover – as it is bounded between 0 and 100, I think I should treat it as a count variable. The data has many zeros (ca. 70%) and is over-dispersed (too many values of 90-100) so I decided to use the hurdle regression model with logit at the first stage and ZTNB at the second.
The standard procedure in analyzing the data of the survey is type 2-Tobit model but these researchers have the data on the amount of turnover from innovative sales and not just the share. In my opinion HRM is equivalent of the Tobit model without a continuous variable but could you confirm/comment, please?
Not dumb at all. This is really complicated, actually.
I do not believe the Hurdle is equivalent to he Tobit model, though there are, I’m sure, connections between them.
It actually sounds like you don’t want a hurdle, as percentage data aren’t counts. Percentage data can have values beyond the decimal–count values can only be whole numbers. Unless I’m misunderstanding what you mean by share of innovative sales….
I am working in ecology on the responses of reptile species to fire and vegetation type. I was planning to use GLMMs with a Poisson distribution. The response variable will be a standardized capture rate (captures/1000 trap nights). Before transformation the data is clearly in the form of counts. After transformation the values are taking on decimal places, so are no longer integers.
Do I still treat this as count data or do I need to change the analysis?
Thanks for the great question. You do need to leave it in terms of counts, but use the number of trap nights as an exposure variable. Here’s some more info.
thank you for your blog.
I have recently started to work on count data. To fit a model I usually use a NB model. However, I was wondering if you have any recommendations for clustering methods. A lot of articles first log-transform their data and perform a hierarchical clustering. What do you think?
That’s a really great question. I’ve never tried to do a cluster analysis on count data. I suspect that approach would work. Most multivariate methods assume normality, but I don’t remember off the top of my head if cluster analysis does. When the mean of a count variable gets large enough (around 10 or so), it starts to approximate a normal distribution anyway. So you may want to check the shape of the distribution before transforming.
Stan Alekman says
Re count data: There is a recommendation that count data that average 5 or more can be trended on a Shewhart XmR chart? I could possibly find the reference when I am home next. Can you comment?
I’m not familiar with Shewhart XmR charts. I just looked it up and I see they’re used in manufacturing. If you’d like to send the reference, I can try to comment.
Or if anyone else can comment on this, please feel free…
It might, although if you have a lot of zeros, a log transform won’t result in normal residual plots, the way it would with a high mean and not a lot of zeros.
I’m not sure where the mixed model is coming into it–I’m guessing you have a design with clustered count data.
It’s always difficult to advise on what’s an appropriate analysis without all the details–for example, if you’re submitting to a high level journal, you ought to use the more sophisticated technique, even if you’re getting the exact same results and reasonable residuals from the linear model. If the answer is more important than how you get to it, then it would be fine. Unfortunately, you probably can’t tell if the linear mixed model is good enough without running the GLMM. 🙂
Daniel Hocking says
Thank you for this excellent blog post. I look forward to listening to the webinar (couldn’t be tune-in in real time). I was what do do if you have count data that has a “large” mean (~10-12) but still many zeros? It could be considered zero inflated if there are “too many” zeros for a poisson distribution with a mean of 10 or it could be considered over dispersed (too many high values for the rest of the distribution). I have heard that some people don’t like NB regressions and quasi-poisson has fallen out of favor for mixed models (even removed from the lme4 package in R). If a log transformed regression resulted in reasonable residual plots, is that a case when it would be better than trying to deal with a GLMM?