There are many types of outcome variables that don’t work in linear models, but look like they should. (I mean, specifically, OLS regression and ANOVA models).
They include discrete counts; truncated or censored variables, where part of the distribution is cut off or measured only up to a certain point; and bounded variables, like proportions and percentages.
This article outlines a particular type of outcome variable: one that measures whether or when an event occurs. They are typically called time-to-event variables, and they have a number of distinguishing characteristics that indicate specialized statistical techniques for analyzing them.
These types of variables were first encountered in medical research. The analysis methods that were developed were called survival analysis, because often the outcome of interest was how long people survived–the time to event was time of survival until death.
A common example would be a test of a potentially life-extending medical treatment, say a surgery for patients with a particular type of cancer. Upon diagnosis, patients would be randomly assigned to the standard treatment or standard treatment plus an experimental surgery. The outcome of the study would be to see how long they survived after diagnosis.
Survival analysis can compare the length of survival time between the standard and experimental treatment groups and control for covariates, such as age, sex, comorbidity (other illness), or other risk factors.
But it quickly became clear that survival analysis methods work equally well on event variables in a variety of fields:
– whether and how long before a marriage ends in divorce
– whether and how long a foster child becomes adopted
– whether and how long before a plant becomes diseased
– whether and how long before a PhD student finishes a dissertation
Time-to-event outcomes have common characteristics, some of which make linear models untenable:
1. The main outcome is measuring likelihood of the occurrence of a specific event, if the event has not already occurred. This event is usually something that takes the individual from one state to another, and the research question is about how predictor variables relate to the propensity for the event to occur.
2. Every case in the data set must be eligible for the event to occur at the beginning of measurement.
3. Time of occurrence has to be well-measured and the beginning of time well-defined. Time can be measured by age or by the occurrence of some event that creates eligibility. Marriages are eligible for divorce only at the start of the marriage. But a PhD student’s time to dissertation completion could begin when they begin graduate school or when they advance to candidacy. Which beginning point is measured depends on the study aims and the available data.
4. While it’s not necessary, it’s common enough that the analysis needs to take into account that the event will not occur for all cases. Not all marriages end in divorce, not all graduate students finish their dissertations. (This is one thing that really messes it up for linear models).
In some studies, the event is actually quite rare, and never occurs for most cases.
When the event fails to occur, it’s called censoring. Censoring essentially means that we have incomplete information about the full length of time to event. Two types of censoring occur commonly in time-to-event studies. In one type, the event fails to occur before the end of the study. For all of these cases, all that is known about time to event is that it is greater than the length of the study.
In a 20-year divorce study, some marriages remain intact at the end of the 20 years. Some of those marriages will end in divorce after the 20-year mark, say at year 22, but for others it will never occur.
But it’s also possible for cases to become censored before the end of the study. If a marriage ends because one partner dies or if a couple drops out of the study early because they moved to another state, we only have partial information about the length of the marriage, in terms of its propensity for divorce.
Survival Analysis and Event History Analysis
There are also many variations in design and variable measurement within survival analysis that determine the exact statistical method to use. One example is whether time is measured continuously or at discrete intervals
But there are also extensions beyond the medical survival studies that require more generalized models. These include events that can occur multiple times (eg. incarceration, quitting smoking, or unemployment) and events with multiple outcomes (dropping out of school or graduating vs staying in school). These broader types of events extend Survival Analysis into a broader range of methods called Event History Analysis.