Poisson Regression Models and its extensions (Zero-Inflated Poisson, Negative Binomial Regression, etc.) are used to model counts and rates. A few examples of count variables include:

– Number of words an eighteen month old can say

– Number of aggressive incidents performed by patients in an impatient rehab center

Most count variables follow one of these distributions in the Poisson family. Poisson regression models allow researchers to examine the relationship between predictors and count outcome variables.

Using these regression models gives much more accurate parameter estimates than trying to fit an ordinary linear regression model whose assumptions rarely fit count data such as normal residuals and constant variance.

But how do the Poisson models handle rates? A rate is just a count per unit time.

The first example would not need a rate, but the second probably will. If all patients are in the center the same number of days, a rate is unnecessary. But if there is variation in the number of days each patient is present, attendance itself could affect the count. A count of 10 incidents out of 180 days is much smaller than a count of 10 out of 15.

Poisson models handle exposure variables by using simple algebra to change the dependent variable from a rate into a count.

If the rate is count/exposure, multiplying both sides of the equation by exposure moves it to the right side of the equation. When both sides of the equation are then logged, the final model contains ln(exposure) as a term that is added to the regression coefficients. This logged variable, ln(exposure), is called the **offset variable**.

Most statistical software will require you to create the logged variable and define it as the offset variable. Only Stata allows you to define either the exposure or the offset variable.

One important feature of an offset variable is that it is required to have a coefficient of 1. This is because it is part of the rate. The coefficient of 1 allows you to theoretically move it back to the left side of the equation to turn your count back into a rate.

What this means theoretically is that by defining an offset variable, you are only adjusting for the amount of opportunity an event has. The assumption here is that, for example, every day in rehab makes a patient equally likely to have an aggressive incident. Each day is simply an opportunity for an incident. A patient in for 20 days is twice as likely to have an incident as a patient in for 10 days.

There is an assumption that the likelihood of events is not changing over time. If, for example, it takes patients a few weeks to learn the consequences of aggressive behavior, then stop or lessen their rates, then time is not just a matter of exposure. Likewise, if patients start becoming more agitated after being in a program after a few months so that the longer residence time is actually creating more aggression, then time is not just a matter of exposure. In either of these cases, number of days in a program would serve better as a predictor than as an exposure variable. As a predictor, the coefficient will be estimated from the data, not set to 1.

This logic can extend to any regression model that has a ratio as a dependent variable. Make sure that you understand the implication that the denominator of that ratio is not affecting the numerator beyond opportunity.

{ 34 comments… read them below or add one }

First of all…THANK YOU VERY MUCH for your magnificent work!

Next, it is my question:

Some data sets have a variable as a count, but also as “rates” per 100000 people (patients, residents, etc.). Because it is more reasonable to compare rates when there are significant differences in the population (for instance among countries, counties, cities, etc), Which is the correct way to formulate the model when the dependent (response) variable is, for instance, “Murders per 100000 residents”?

I hope you can guide me!…

Thanks again and all the best!

Hi! Great post, thanks a lot!! Here is my question: I am studying the rate of specific behaviors displayed by judges observed in a sample of case hearings (the count vas goes from 0 behaviors to 3 behaviors). When running GOF using estat gof, or using glm, both Pearson and Deviation parameters are well below the threshold (yay, my models fit the data). However, when adding an exposure variable (duration of the hearings), the Pearson test and the Deviation test show different and conflicting results (reject Ho for Pearson, fail to reject Ho for Deviation). What should I do? Part of the problem may be that my sample size is small (380 hearings) so adding the exposure parameter may be creating instability for the Pearson test. Can I still move forward and calculate expected values for the count outcome, count distributions, etc.? Thanks in advance!!

HI Karen,

I want to assess the association between EmergencyVisitDueToInfection vs. Temperature. My question is how I can control age and gender in the model, please?

For example, the table 1 is per row per patient. I created table 2 which contains Infection-Count per date before feeding the Poisson model. However, I have difficulties to create a gender or age variable in table 2 to feed the Poisson Model because it is per row per date, and there are lots of patient each date.

Table 1:

ID Date Age Gender Infection (1=Yes, 0=No)

1 2018-01-01 20 M 1

2 2018-01-01 30 F 0

3 2018-01-02 40 F 1

4 2018-01-02 50 F 1

……

Table 2:

Date Count of Infection

2018-01-01 1

2018-01-02 2

……

Many thanks,

Helen

Hi Karen,

I’m unclear how to use negative binomial regression for my situation. My dependent variable is vaccine exemptions (count), but I also have rates. My independent variables are: school type, geographic location, free and reduced school lunch rates, and I’m trying to analyze the difference in exemption rates from 2014 to 2015. I’m not sure what the counts really relay when I’m more interested in the rates. Also, how do I get adjusted vs. unadjusted IRR using SPSS?

Very nice information, thank you very much

but how to interprete the result?

Hi Karen,

Great post, thank you for taking the time to put it up!

I am working on investigating trends in incidence rates over roughly 20 years. Besides joinpoint, do you have a recommendation on how to do this? For example, would you recommend using splines in poisson regression?

Thank you for your help!

Hello,

I just have a small question whether or not prevalence is a count variable and poisson regression can teherefore be conducted for a set of independent variables (age, country, population size). Could you please help with your knowledge?

Thanks in advance.

Hi Karen,

This was very useful, but I couldn’t help get a bit confused. So, only if I have a ratio as dependent variable should I use an exposure variable?

In my case, I would like to know how incidence of a disease variates in different countries. I have incidence as a count: I have total dignosis over several years (sum of yearly diagnosis). Of course, population is affecting total diagnosis. Should I use the population to transform incidence into a rate, or should I use population as an exposure variable, although my dependent variable is a count already?

Thank you! All the best

Hi Joana,

You can’t use the rate as the DV–it has to be a count. So use the count as the DV and population as the exposure variable.

Hi Karen,

I found this very helpful, thank you !

I’m just picking up on Anne’s second question above.

Would it be possible to include a version of time both as an offset exposure variable (to control for the pure time effect, more time = more incidents, as you described above) and as an IV (let’s say as a dummy for for people that stay longer than 10 days) in order to see how i.e. the rates of agressive behaviour is affected by time (outside the pure exposure mechanic)?

Thanks,

Kalle

thank you very much, now i understand the offset on count model

Hi Karen,

Here’s a hopefully quick question: What are the implications (perhaps anticipated types of bias) that are expected from using OLS rather than a poisson model when the dependent variable is a count variable? Is there a rule of thumb as to when it matters and when it doesn’t? Thanks!

Hi Karen,

Quick question. I am running a poisson regression with an exposure variable. In Stata the syntax is pretty straightforward: poisson y x1, exposure(z), where y is my count var, x1 is my independent var, and z is my exposure var. In this context, do I interpret the coefficient on x1 (ie beta1) as the effect of x1 on the count y or instead as the effect of x1 on the rate y/z. Thanks and sorry for the simple question!!

Hi Karen,

First of all, I would like to thank you for the great article here and hosting this conversation room!

I was wondering if you have seen any papers/text books, that discuss exposure in a setting similar to what I describe below:

Let’s assume we are interested in the number of kids in a class who develop a specific type of disorder. Our explanatory variables are the number of kids with certain ages, the number of female, and male kids, and some other explanatory variables that are counts of kids in different cohorts:

E(Y|X1,X2,…,Xn) = f (X1,X2,…,Xn)

Y = # of kids with the disorder

X1 = # of less-than-5-year-old kids

X2 = # of higher-than-5-year-old kids

X3 = # of females

X4 = # of males

and so on.

Exposure = X = total number of kids

As you see, all of variables are counts, as well as the exposure.

Thank you for the great web site. I read the articles “The Exposure Variable in Poisson Regression Models” and “Poisson Regression Analysis for Count Data” and have a follow-up question.

It would be a big help if you could give a practical example (by hand) how Poisson regression is used to calculate a time trend line, and calculate a confidence interval for whether there is a trend. Here is some sample data if you would like (Texas viral hepatitis deaths):

Year, Events, Population, Rate per 100000

—————————-

1990, 108, 16986510, 0.64

1991, 154, 17349000, 0.89

1992, 141, 17655650, 0.80

1993, 212, 18031484, 1.18

1994, 254, 18378185, 1.38

1995, 283, 18723991, 1.51

1996, 353, 19128261, 1.85

1997, 383, 19439337, 1.97

1998, 432, 19759614, 2.19

—————————–

I understand the algorithm for least squares slope, and how to analyze that slope for significance. But I want a trend method to take into account the variance of the data points. Obviously, if each data point is based on hundreds of events, the slope is more reliable than if each data point is based on just a few events. I have also seen chi-square suggested, know how to do chi-square, plan to give chi-square a try, but I saw a lot more references to Poisson regression for time trend analysis.

I am familiar with the Poisson distribution itself, so that’s not the problem. But I can’t find a practical example of Poisson regression anywhere. The only way I can understand an algorithm is to do it by hand.

Thanks,

Daniel

hi .i want learn stata .you can help me .very NECESSARY

thanks

Hmmm, not very quickly. Unfortunately, we haven’t yet included Stata in our trainings. (We will at some point, though). I know that Michael Mitchell has an introductory book on using Stata. I would suggest starting there.

Hi Karen,

First, I would like to say that since finding this great website I’ve returned to it an infinite number of times, and have recommended it to an infinite number of people. If I could, I would spend all day reading the material and viewing the webinars, but unfortunately my thesis is not going to write itself…

I have a couple of what are probably silly questions, but I have found contradicting info on the matter, so just hoping you can settle this.

1) Is there a certain sample size below which Poisson or Negative Binomial regressions are not recommended/credible? Is this in terms of, say, study subjects (e.g., 26 individuals are not enough, but 78 are much better) or in terms of subjects x the observations of their count data (e.g., each of the 26 subjects has 10 observations, so 260 observations overall is ok)?

2) Is it true that when using exposure as an offset variable, it first has to be transformed into log? log10 or natural log? If so, do current versions of SPSS do this automatically when defining an offset variable or do we need to handle this simple data transformation prior?

Also, I was wondering if there is any way to participate in workshops retrospectively since you write that videos are available for those. I missed out on registering for your recent workshop as the site said it was already full.

Thanks again for being so good at explaining this stuff. You are appreciated by many!

Grateful2U

Hi Grateful,

First, thanks for the kind words. These are great questions.

1. It’s a function of the number of subjects and the number of parameters in the model.

2. This depends on your software. Some want it one way and some want it the other. I would suggest checking your manual.

3. Yes. We don’t yet have past workshops available for sale in an organized way (though we’re working on it), but feel free to contact my team using the Contact form. We do make past workshops available on request.

Hi Karen,

Question, can an exposure variable (i.e. if I am scaling my DV by a variable) can it ALSO be an independent variable? It seems as though from your responses above that it probably can’t be, but I’m not sure. In your example, clearly the longer a patient is admitted the more opportunity there is to be an incident, but as you also point out the length of a patient’s stay may also affect the likelihood in other ways beyond just increased opportunity. Can you include the length of the stay as an exposure variable to represent the increased opportunity and then also include it as an independent variable to pick up the information which is not exposure related?

This website is incredibly helpful. I hope they pay you the big bucks.

Anne

Hi Anne,

The exposure variable is an independent variable in the model, but its coefficient is constrained to 1.

I don’t know how you could include it twice, but you could always include it as a predictor instead.

And I lol’d about the big bucks. Not sure who ‘they’ are. 🙂

Karen

Sorry that I don’t know who ‘they’ are either, or I would send them your way …

If a measurement W is in there twice (once as the exposure variable and once as an IV) then the logged value of W (as the exposure variable) could be swung to the lhs to scale the DV (because the exposure coefficient is 1) and if the coefficient on W as an IV was significant you would go from there interpreting the coefficient just like you would any other coefficient in the model, no? e^coef.? The only issue I can come up with is that the two variables (log W and W) are likely highly correlated, but wouldn’t that work against having findings on W? Then again with logW constrained to a coefficient of 1 … would that preclude me from using it that way though? If the coefficient on W is significant, then there is just a combined effect, no? I’m just not sure what to think … I am wondering if you can think of another angle to this that would make it problematic mathematically to consider.

Why can’t you use an explanatory variable already included in the model as an exposure term? For example, I’m running a regression of crime on guns and want to control for the effect of population on the other covariates, but also want to adjust the amount of opportunity for crime by the population–thus the inclusion of population as an independent variable and exposure term.

Hi Giovanni,

When you put an exposure term in, you are already including it as a predictor. You’re just setting its coefficient to 1 instead of letting the model estimate it. This is what allows you to interpret your other IRRs in terms of rate per unit of exposure.

So if you include it as both predictor and exposure, you’ll have perfect multicollinearity between two Xs.

I am trying to understand the idea of perfect collinearity as a show stopper. Wouldn’t the same argument be similar for polynomial terms.

Karen,

Thanks for the helpful article. When running generalized estimating equations, would you advise doing what you described above, or would it be acceptable to use the calculated rate as the DV and specify a poisson distribution and a log link? Thanks!

Evan

Hi Evan. Most (I want to say all) stat software requires that the data be discrete if you specify a Poisson distribution. If anything has decimals, as will happen if you calculate a rate, it won’t run. Hence the need for the exposure variable.

Thanks for the response Karen. I’m running Stata 12 MP.

From you’re earlier response to a comment, it looks like any IVs about my patient population (the exposure) would have to be expressed as rates. My initial thought was that i could include count of adult females instead of % of adult females because the exposure would divide the count and therefore create the rate itself. But this seems to be incorrect?

Also, are there any resources you would suggest consulting regarding exposure for count models? I’ve searched online but haven’t found much. Thanks for your time!

Evan

Hi Evan,

One book I really like on Poisson Regression in general is Scott Long’s book. The title is something like Regression Models for Categorical and Limited Dependent Variables. I know there is info there on the exposure variable, and it’s relatively non-math laden.

He also has a book on the same topic, but where everything is applied in Stata. I haven’t read that one.

Karen

Let me start by saying that I think your site is fantastic. When using an offset it appears that it is only applied to the dependent variable, correct? I am curious if the same offset variable used to convert the dependent variable to a rate is also used to scale the independent variables in the model as well or if this must be done in another way. Thanks.

Hi Bryan,

First, thanks for the kind words.

Yes. The offset is really only offsetting the DV. If the IVs are also rates, you’d have to express them in terms of those rates.

Best,

Karen

Thanks for your answer.

I can express them as rates by dividing the numbers recorded (for example kg of fruit) by the area sampled for each and then taking the log of each and use them as covariates but this still doesn’t really reflect the different area sampled for each and how much each subject (fruit transects with different lengths) ought to contribute overall. Do you know of a way to do this in SPSS? Using the GEE scale weight is not an option because that is also only for the dependent variable. Standard ‘Weight cases” also wouldn’t work because that would weight all the variables and cause problems with the offset.

Thanks again for your help, I am really glad that I came across your site and I plan to use you when I need consultation help in the future.

Bryan

Thanks so much! After reading this, I better undertand the poisson models that I am running on my data (comparing mortality rates during exposure time intervals vs. non-exposuer time intervals). In medical school, we don’t recieve great training on interpreting the stats behind the research we consume. If we don’t understand the basic assumptions underlying the stats, I don’t know that we can properly interpret anything we read. This is a great contribution!

Thanks, Tarak. Glad I could help!

Karen

{ 1 trackback }