Generalized Linear Models in R, Part 7: Checking for Overdispersion in Count Regression

by guest

Share

by David Lillis, Ph.D.

In my last blog post we fitted a generalized linear model to count data using a Poisson error structure.

We found, however, that there was over-dispersion in the data – the variance was larger than the mean in our dependent variable.

Over-dispersion is a problem if the conditional variance (residual variance) is larger than the conditional mean.  One way to check for and deal with over-dispersion is to run a quasi-poisson model, which fits an extra dispersion parameter to account for that extra variance.

Now let’s fit a quasi-Poisson model to the same data.

model2 <- glm(Students ~ Days, quasipoisson)

summary(model2)
Call:
glm(formula = Students ~ Days, family = quasipoisson)

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-2.00482  -0.85719  -0.09331   0.63969   1.73696  

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1.990235   0.074789   26.61   <2e-16 ***
Days        -0.017463   0.001539  -11.35   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for quasipoisson family taken to be 0.7939441)

    Null deviance: 215.36  on 108  degrees of freedom
Residual deviance: 101.17  on 107  degrees of freedom
AIC: NA

Number of Fisher Scoring iterations: 5

The outcome of our attempt to account for over-dispersion is that the residual deviance has not changed.

The dispersion parameter, which was forced to be 1 in our last model, is allowed to be estimated here. In fact, it is estimated at .79.

This parameter tells us how many times larger the variance is than the mean. Since our dispersion was less than one, it turns out the conditional variance is actually smaller than the conditional mean. We have under-dispersion, not over.

We can extract the model coefficients in the usual way:

model2$coefficients  
(Intercept)     Days 
 1.99023497 -0.01746317

Anyway – we now plot the regression. We set up a time axis running from 0 to 150 (the number of days). However, we include small increments of 0.1 in order to create a smooth appearance to our plot. We will evaluate the model on these values and then use those values to plot the model.

timeaxis <-seq 0="" 150="" 1="" pre="">

Anyway – we now plot the regression. We set up a time axis running from 0 to 150 (the number of days). However, we include small increments of 0.1 in order to create a smooth appearance to our plot. We will evaluate the model on these values and then use those values to plot the model.

timeaxis <-seq (0,150,0.1)

Now we use the predict() function to set up the fitted model values.

Y <- predict(model2, list(Days = timeaxis))

plot(Days, Number, xlab = "DAYS", ylab = "STUDENTS", pch = 16)

Finally, we plot the fitted model. We take the exponential of the fitted values because the fitted values are returned on a logarithmic scale. Taking the exponential back-transforms from the log scale to the original data.

lines(timeaxis, exp(Y), lwd = 2, col = "blue")

The graph shows a non-linear decrease in cases with number of days. Of course, instead of taking the exponential of the fitted values, we could also have used the predict() function together with the argument type = "response".

Z <- predict(model2, list(Days = timeaxis), type = "response")

plot(Days, Number, xlab = "DAYS", ylab = "NUMBER", pch = 16)

lines(timeaxis, Z, lwd = 2, col = "red") 

image001

Let’s calculate the impact on the number of cases arising from a one day increase along the time axis. First we take the exponential of the coefficients.

coeffs <- exp(coef(model2))

coeffs
(Intercept)        Days 
  7.3172529   0.9826884

We calculate the 95% confidence interval (upper and lower confidence limits) as follows:

CI <- exp(confint.default(model2))

CI
              2.5 %    97.5 %
(Intercept) 6.3195674 8.4724454
Days        0.9797296 0.9856562

We can calculate the change in number of students presenting with the disease for each additional day, as follows:

1 - 0.9826884
[1] 0.0173116

The reduction (rate ratio) is approximately 0.02 cases for each additional day.

****

See our full R Tutorial Series and other blog posts regarding R programming.

About the Author: David Lillis has taught R to many researchers and statisticians. His company, Sigma Statistics and Research Limited, provides both on-line instruction and face-to-face workshops on R, and coding services in R. David holds a doctorate in applied statistics.

Bookmark and Share

{ 4 comments… read them below or add one }

Fabio

great post! i just have want to underline that: the term quasipoisson in the formula of glm() is not a quasipoisson distribution. i here quote Zuur’s book pp.226(mixed model effects and their extensions in ecology)
quote: “specifiying the family option as quasipoisson instead of poisson gives the imporession that there is a quasi-Poisson distribution but there is no such thing! all we do here is specify the mean and variance relationchip and an exponential link between the expected values and explanatory variables. it is a software issue to call this ‘quasipoisson’. Do not write in your report or paper that you used a quasi-Poisson distribution.”
thus saying here that you used a quasipoisson is a mistake.

Reply

Karen

Hi Fabio, it wouldn’t be a mistake to say you ran a quasipoisson model, but you’re right, it is a mistake to say you ran a model with a quasipoisson distribution. The difference is subtle. As David points out the quasi poisson model runs a poisson model but adds a parameter to account for the overdispersion.

Karen

Reply

Sylvia

Thanks for this great post. When I use a quasi-poisson model to get the dispersion parameter for 8 different outcomes, I get values ranging from 1.24 – 2. What is a good “cutoff” for overdipsersion? Are all of these overdispersed since they are >1? Just trying to get a better sense of how to make this decision. Thanks!

Reply

Luke

Thanks for writing this helpful tutorial. I’m trying to recreate and am wondering where the “Number” variable come from in your first plot? Thanks!

Reply

Leave a Comment

Please note that, due to the large number of comments submitted, any comments on problems related to a personal study/project will not be answered. We suggest joining Statistically Speaking, where you have access to answers and more resources 24/7.

Previous post:

Next post: