by Jeff Meyer
In a previous post we explored bounded variables and the difference between truncated and censored. Can we ignore the fact that a variable is bounded and just run our analysis as if the data wasn’t bounded?
Issues that arise when analyzing truncated data
Count data, which consists of non-negative integers, are naturally bounded–you can’t have negative counts.
Poisson and negative binomial regression models are used to analyze count data. Collectively known as count models, they assume the possibility of zero counts. In fact, in many data sets there are more zeros than expected by the probability distribution. This is called zero inflation. This isn’t always the case, however.
Count data can also be truncated, usually at zero. Examples of count data that would be truncated at zero:
- The population is admitted hospital patients and the researcher wants to model the predictors of the length of stay in the hospital. The study would include only those patients that spent at least a day in the hospital. Anyone with zero days in the hospital wouldn’t be in the hospital’s data records.
- A study wants to determine how many blogs a blogger will write based on various topics of interest. You can’t be a blogger if you don’t write at least one blog, so bloggers as a population are limited to those with at least one blog.
If we analyze zero-truncated data assuming a non-truncated probability distribution we will have biased results. The closer the mean is to zero the greater the bias.
Non-count data (continuous outcome variable) can also be truncated at zero.
For example, we might want to determine the impact of a number of predictors on wage income, such as education, marital status, region, and parent’s social economic status.
If the data come from employment records, the sample will not contain stay at home moms or dads, who will report zero income.
Reporting the zero income of stay at home parents will reduce the mean of income earned. Excluding those with zero income can lead to “selection bias” and the results from an OLS regression will be biased if we are interested in the value of the outcome variable for the entire population.
Some academicians such as Christopher Baum, An Introduction to Modern Econometrics Using Stata, argue that “we cannot even use truncated data to make inferences about the subpopulation. A regression estimated from the subpopulation will yield coefficients that are biased toward zero as well as the estimate of the variance that is biased downward.”
Jeff Meyer is a statistical consultant with The Analysis Factor, a stats mentor for Statistically Speaking membership, and a workshop instructor. Read more about Jeff here.