When our outcome variable is the frequency of occurrence of an event, we will typically use a count model to analyze the results. There are numerous count models. A few examples are: Poisson, negative binomial, zero-inflated Poisson and truncated negative binomial.
There are specific requirements for which count model to use. The models are not interchangeable. But regardless of the model we use, there is a very important prerequisite that they all share.
We must identify the period of time or area of space in which the counts were generated.
The term used for modeling the period of time or area of space is exposure. The exposure variable modifies each observation from a count into a rate per period of time or area.
For example, if you were to count birds at various locations, you would need to know the area of space in which you are doing the count. Ten birds counted within 100 square feet is more than 10 birds counted within 625 square feet.
Counting the number of births during the month of February (28 days) represents a different length of time than the number of births during the month of January (31 days).
If we don’t take into account the different exposures, we will have biased results.
An Exposure Variable Example
Let’s look at a model where the outcome is the number of deaths.
The predictors in the model are whether the deceased smoked and what age bracket they were in. The exponentiated model coefficients represent the incidence rate ratio (IRR) of the category compared to the base category.
The results tell us that smokers have a rate of death 6.24 times greater than non-smokers when controlling for their age bracket.
We also find that 55- to 64-year-olds have a rate of death 6.88 times more than 35- to 44-year-olds.
Interesting enough, we see that the rate of death for 75- to 84-year-olds is lower than 55- to 64-year-olds, when controlling for smoking. I would have to think that doesn’t make sense.
Converting Counts to Rates
Question: Over what period or area were the outcomes measured? Were they measured over the same period of time and over the same size population?
It turns out they were not.
Each observation measures the number of deaths by person-years. The data in this analysis were collected from English counties. It contains the number of smokers and non-smokers per age category and the number of deaths over a specific time period in each county.
As you can imagine, the number of people living in county A and county B will differ. So will the number of years each county is measured.
Including an exposure variable, such as person-years, allows the counts of deaths to be comparable. We don’t want to predict more deaths just because a county has more people or because we measured it for a longer period of time.
After including person-years in our model as the exposure variable, we get very different results.
The incidence rate ratio drops from 6.24 to 1.43 when comparing smokers to non-smokers. In addition, as age increases, the incident rate ratio (as compared to the base category) increases. This intuitively makes sense.
Note: Some statistical software requires the analyst to include the “offset” variable rather than the “exposure” variable. If that is the case with your software, you will need to take the natural log of the variable in order to include it in the model.
**This article was updated Nov. 19th, 2020.
Jeff Meyer is a statistical consultant with The Analysis Factor, a stats mentor for Statistically Speaking membership, and a workshop instructor. Read more about Jeff here.