• Skip to primary navigation
  • Skip to main content
  • Skip to primary sidebar
The Analysis Factor

The Analysis Factor

Statistical Consulting, Resources, and Statistics Workshops for Researchers

  • Home
  • About
    • Our Programs
    • Our Team
    • Our Core Values
    • Our Privacy Policy
    • Employment
    • Guest Instructors
  • Membership
    • Statistically Speaking Membership Program
    • Login
  • Workshops
    • Online Workshops
    • Login
  • Consulting
    • Statistical Consulting Services
    • Login
  • Free Webinars
  • Contact
  • Login

Generalized Linear Models in R, Part 7: Checking for Overdispersion in Count Regression

by guest 8 Comments

by David Lillis, Ph.D.

In my last blog post we fitted a generalized linear model to count data using a Poisson error structure.

We found, however, that there was over-dispersion in the data – the variance was larger than the mean in our dependent variable.

Over-dispersion is a problem if the conditional variance (residual variance) is larger than the conditional mean.  One way to check for and deal with over-dispersion is to run a quasi-poisson model, which fits an extra dispersion parameter to account for that extra variance.

Now let’s fit a quasi-Poisson model to the same data.

model2 <- glm(Students ~ Days, quasipoisson) summary(model2) Call: glm(formula = Students ~ Days, family = quasipoisson) Deviance Residuals: Min 1Q Median 3Q Max -2.00482 -0.85719 -0.09331 0.63969 1.73696 Coefficients: Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1.990235   0.074789   26.61   <2e-16 ***
Days        -0.017463   0.001539  -11.35   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for quasipoisson family taken to be 0.7939441)

    Null deviance: 215.36  on 108  degrees of freedom
Residual deviance: 101.17  on 107  degrees of freedom
AIC: NA

Number of Fisher Scoring iterations: 5

The outcome of our attempt to account for over-dispersion is that the residual deviance has not changed.

The dispersion parameter, which was forced to be 1 in our last model, is allowed to be estimated here. In fact, it is estimated at .79.

This parameter tells us how many times larger the variance is than the mean. Since our dispersion was less than one, it turns out the conditional variance is actually smaller than the conditional mean. We have under-dispersion, not over.

We can extract the model coefficients in the usual way:

model2$coefficients  
(Intercept)     Days 
 1.99023497 -0.01746317

Anyway – we now plot the regression. We set up a time axis running from 0 to 150 (the number of days). However, we include small increments of 0.1 in order to create a smooth appearance to our plot. We will evaluate the model on these values and then use those values to plot the model.

timeaxis <-seq 0="" 150="" 1="" pre="">

Anyway – we now plot the regression. We set up a time axis running from 0 to 150 (the number of days). However, we include small increments of 0.1 in order to create a smooth appearance to our plot. We will evaluate the model on these values and then use those values to plot the model.

timeaxis <-seq (0,150,0.1)

Now we use the predict() function to set up the fitted model values.

Y <- predict(model2, list(Days = timeaxis))

plot(Days, Number, xlab = "DAYS", ylab = "STUDENTS", pch = 16)

Finally, we plot the fitted model. We take the exponential of the fitted values because the fitted values are returned on a logarithmic scale. Taking the exponential back-transforms from the log scale to the original data.

lines(timeaxis, exp(Y), lwd = 2, col = "blue")

The graph shows a non-linear decrease in cases with number of days. Of course, instead of taking the exponential of the fitted values, we could also have used the predict() function together with the argument type = “response”.

Z <- predict(model2, list(Days = timeaxis), type = "response")

plot(Days, Number, xlab = "DAYS", ylab = "NUMBER", pch = 16)

lines(timeaxis, Z, lwd = 2, col = "red")

image001

Let’s calculate the impact on the number of cases arising from a one day increase along the time axis. First we take the exponential of the coefficients.

coeffs <- exp(coef(model2))

coeffs
(Intercept)        Days 
  7.3172529   0.9826884

We calculate the 95% confidence interval (upper and lower confidence limits) as follows:

CI <- exp(confint.default(model2))

CI
              2.5 %    97.5 %
(Intercept) 6.3195674 8.4724454
Days        0.9797296 0.9856562

We can calculate the change in number of students presenting with the disease for each additional day, as follows:

1 - 0.9826884
[1] 0.0173116

The reduction (rate ratio) is approximately 0.02 cases for each additional day.

****

 

See our full R Tutorial Series and other blog posts regarding R programming.

About the Author: David Lillis has taught R to many researchers and statisticians. His company, Sigma Statistics and Research Limited, provides both on-line instruction and face-to-face workshops on R, and coding services in R. David holds a doctorate in applied statistics.

Bookmark and Share

Poisson and Negative Binomial Regression for Count Data
Learn when you need to use Poisson or Negative Binomial Regression in your analysis, how to interpret the results, and how they differ from similar models.

Tagged With: count regression, count variable, generalized linear models, GLM, overdispersion, Poisson Regression, R

Related Posts

  • Generalized Linear Models in R, Part 6: Poisson Regression for Count Variables
  • Generalized Linear Models (GLMs) in R, Part 4: Options, Link Functions, and Interpretation
  • Generalized Linear Models in R, Part 5: Graphs for Logistic Regression
  • Count Models: Understanding the Log Link Function

Reader Interactions

Comments

  1. Gershon says

    January 10, 2020 at 4:30 am

    Thanks very much for the post. I would love to know how to use the Wald test to test for overdispersion in a Poisson and negative binomial regression model.
    Thank you in advance

    Reply
  2. Caroline Rhomberg says

    June 17, 2019 at 9:18 am

    Hi,

    Just wanted to say thank you SO much for all these posts. They really helped me to understand GLM and their purpose….especially since I have a final tomorrow 🙂

    You really saved me.

    Reply
  3. Doron says

    November 12, 2018 at 3:52 pm

    Hi Am also playing with the possion and quasi poisson in glm.
    I have found that the parameter fitting is identical using both families. It is only the dispersion parameter that changes.
    Can anyone explain this?

    Reply
    • Karen Grace-Martin says

      November 27, 2018 at 3:46 pm

      That’s what quasi poisson is. It fits an extra parameter that allows the variance > mean. Poisson doesn’t.

      Reply
  4. Fabio says

    April 1, 2018 at 12:46 pm

    great post! i just have want to underline that: the term quasipoisson in the formula of glm() is not a quasipoisson distribution. i here quote Zuur’s book pp.226(mixed model effects and their extensions in ecology)
    quote: “specifiying the family option as quasipoisson instead of poisson gives the imporession that there is a quasi-Poisson distribution but there is no such thing! all we do here is specify the mean and variance relationchip and an exponential link between the expected values and explanatory variables. it is a software issue to call this ‘quasipoisson’. Do not write in your report or paper that you used a quasi-Poisson distribution.”
    thus saying here that you used a quasipoisson is a mistake.

    Reply
    • Karen says

      April 9, 2018 at 4:06 pm

      Hi Fabio, it wouldn’t be a mistake to say you ran a quasipoisson model, but you’re right, it is a mistake to say you ran a model with a quasipoisson distribution. The difference is subtle. As David points out the quasi poisson model runs a poisson model but adds a parameter to account for the overdispersion.

      Karen

      Reply
  5. Sylvia says

    April 7, 2017 at 2:28 pm

    Thanks for this great post. When I use a quasi-poisson model to get the dispersion parameter for 8 different outcomes, I get values ranging from 1.24 – 2. What is a good “cutoff” for overdipsersion? Are all of these overdispersed since they are >1? Just trying to get a better sense of how to make this decision. Thanks!

    Reply
  6. Luke says

    February 10, 2016 at 2:11 pm

    Thanks for writing this helpful tutorial. I’m trying to recreate and am wondering where the “Number” variable come from in your first plot? Thanks!

    Reply

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Please note that, due to the large number of comments submitted, any questions on problems related to a personal study/project will not be answered. We suggest joining Statistically Speaking, where you have access to a private forum and more resources 24/7.

Primary Sidebar

This Month’s Statistically Speaking Live Training

  • February Member Training: Choosing the Best Statistical Analysis

Upcoming Workshops

  • Logistic Regression for Binary, Ordinal, and Multinomial Outcomes (May 2021)
  • Introduction to Generalized Linear Mixed Models (May 2021)

Read Our Book



Data Analysis with SPSS
(4th Edition)

by Stephen Sweet and
Karen Grace-Martin

Statistical Resources by Topic

  • Fundamental Statistics
  • Effect Size Statistics, Power, and Sample Size Calculations
  • Analysis of Variance and Covariance
  • Linear Regression
  • Complex Surveys & Sampling
  • Count Regression Models
  • Logistic Regression
  • Missing Data
  • Mixed and Multilevel Models
  • Principal Component Analysis and Factor Analysis
  • Structural Equation Modeling
  • Survival Analysis and Event History Analysis
  • Data Analysis Practice and Skills
  • R
  • SPSS
  • Stata

Copyright © 2008–2021 The Analysis Factor, LLC. All rights reserved.
877-272-8096   Contact Us

The Analysis Factor uses cookies to ensure that we give you the best experience of our website. If you continue we assume that you consent to receive cookies on all websites from The Analysis Factor.
Continue Privacy Policy
Privacy & Cookies Policy

Privacy Overview

This website uses cookies to improve your experience while you navigate through the website. Out of these, the cookies that are categorized as necessary are stored on your browser as they are essential for the working of basic functionalities of the website. We also use third-party cookies that help us analyze and understand how you use this website. These cookies will be stored in your browser only with your consent. You also have the option to opt-out of these cookies. But opting out of some of these cookies may affect your browsing experience.
Necessary
Always Enabled

Necessary cookies are absolutely essential for the website to function properly. This category only includes cookies that ensures basic functionalities and security features of the website. These cookies do not store any personal information.

Non-necessary

Any cookies that may not be particularly necessary for the website to function and is used specifically to collect user personal data via analytics, ads, other embedded contents are termed as non-necessary cookies. It is mandatory to procure user consent prior to running these cookies on your website.

SAVE & ACCEPT