When our research question is focused on the frequency of occurrence of an event, we will typically use a count model to analyze the results. There are numerous count models. A few examples are: Poisson, negative binomial, zero-inflated Poisson and truncated negative binomial.
There are specific requirements as to which count model to use. The models are not interchangeable. But regardless of the model we use, there is a very important prerequisite that they all share.
We must identify the period of time or area of space in which the counts were generated.
The term used for modeling the period of time or area of space is exposure. The exposure variable basically modifies each observation so that the count outcome is weighted based on the period of time or area.
For example, if you were to count birds at various locations, you would need to know the area of space in which you are doing the count. Ten birds counted within 100 square feet is more than 10 birds counted within 625 square feet.
Counting the number of births during the month of February (28 days) represents a different length of time as compared to the number of births during the month of January (31 days).
If we don’t take into account the different exposures for the observations within our data, we will have biased results due to some observations having higher or lower non-normalized counts.
Let’s look at a model where the outcome is the number of deaths.
The predictors in the model are whether the deceased smoked and what age bracket they were in. The coefficients of the model represent the incidence rate ratio (IRR) of the category stated as compared to the base index for that categorical variable.
The results tell us that smokers have a rate of death 6.24 times greater than non-smokers when controlling for their age bracket.
We also find out that people who are in the 55- to 64-year-old age bracket have a rate of death that is 6.88 times more than those in the 35- to 44-year-old bracket.
Interesting enough, from the results we see that the rate of death for those in the 75- to 84-year-old bracket is lower than those in the 55- to 64-year-old bracket when controlling for smoking. I would have to think that doesn’t make sense.
Question: Over what period or area were the outcomes measured? Were they measured over the same period of time and over the same size population?
It turns out they were not.
Each observation measures the number of deaths by person-years. The data in this analysis was collected from English counties. The number of smokers and non-smokers per five age categories living within the county as well as the number of deaths was counted over specific period of times.
As you can imagine, the number of people living in county A is going to be different than the number in county B. In addition, not every county was measured for the same number of years.
Including an exposure variable for the total number of people observed, such as person-years, allows the counts of deaths to be comparable. We don’t want to be predicting more deaths just because there are more people in a county or because it was measured for a longer period of time.
After including person-years in our model as the exposure variable, we get very different results.
The incidence rate ratio drops from 6.24 to 1.43 when comparing smokers to non-smokers. In addition, as age increases, the incident rate ratio (as compared to the base category) increases. This intuitively makes sense.
Note: Some statistical software requires the analyst to include the “offset” variable rather than the “exposure” variable. If that is the case with your software, you will need to take the natural log of the variable in order to include it in the model.
Jeff Meyer is a statistical consultant with The Analysis Factor, a stats mentor for Statistically Speaking membership, and a workshop instructor. Read more about Jeff here.