One of the most common—and one of the trickiest—challenges in data analysis is deciding how to include multiple predictors in a model, especially when they’re related to each other.
Here’s an example. Let’s say you are interested in studying the relationship between work spillover into personal time as a predictor of job burnout.
You have 5 categorical yes/no variables that indicate whether a particular symptom of work spillover is present (see below).
While you could use each individual variable, you’re not really interested if one in particular is related to the outcome. Perhaps it’s not really each symptom that’s important, but the idea that spillover is happening.
One possibility is to count up the number of items to which each respondent said yes. This variable will measure the degree to which spillover is happening. In many studies, this is just what you need.
But it doesn’t tell you something important—whether there are certain combinations that generally co-occur, and is it these combinations that affect burnout?
In other words, what if it’s not just the degree of spillover that’s important, but the type?
Enter Latent Class Analysis (LCA).
LCA is a measurement model in which individuals can be classified into mutually exclusive and exhaustive types, or latent classes, based on their pattern of answers on a set of categorical indicator variables. (Factor Analysis is also a measurement model, but with continuous indicator variables).
Probability of ‘Yes’ response for each Class |
||||
Item |
Class 1 (20%) |
Class 2 (61%) |
Class 3 (12%) |
Class 4 (7%) |
Regularly brings home work to work on in the evenings |
.30 |
.08 |
1.0 |
.66 |
Is asked to work weekends to meet deadlines |
.10 |
.03 |
.47 |
1.0 |
Is expected to answer emails from the office within an hour outside of working hours |
.93 |
.04 |
.15 |
.96 |
Checks work email from home |
.84 |
.45 |
.91 |
.94 |
Is expected to be on call during vacations |
.66 |
.15 |
.06 |
.88 |
True class membership is unknown for each individual. As categories of a latent variable, these classes can’t be directly measured other than through the patterns of responses on the indicator variables.
There are two sets of parameters in an LCA. The first is the set of inclusion probabilities that any random person will be in any latent class. You can see in the example above that there are 4 classes, and that 20% of respondents are in Class 1, 61% are in Class 2, etc.
The blue numbers in each column are the second type of parameters, equivalent to factor loadings in confirmatory factor analysis. Each is the conditional probability that someone in a particular class would respond ‘yes’ to a certain item. These parameters are used to interpret the classes.
For example, the largest class, Class 2, might be interpreted as the “Low Spillover” group. Their probability of answering ‘yes’ to any of the 5 questions is relatively low. The only one that is a little bit high is ‘Checks work email from home,’ but even so, this group does this at the lowest probability of any of the classes.
Likewise, Class 4, the smallest, has a pretty high probability of answering ‘yes’ to every single question. This class would be the “High Spillover” group.
So far, it’s not very interesting, right? It just seems a level of degree.
But Classes 1 and 3 are more interesting.
Class 1 has pretty high probabilities of answering ‘yes’ to three of the questions and very low probabilities of answering ‘yes’ to the other two. If you examine what they’re saying yes to, they’re all about being available to the company outside of work hours. So their personal lives are often interrupted, but they’re not regularly working long hours.
Compare this to class 3. Class 3 is quite different. Members of Class 3 are highly likely to check work email from home, but they’re also regularly putting in extra work in the evenings and, to a lesser extent, on weekends. They’re not expected to be at the beck and call of work, however. (Maybe they’re the ones in the office working late).
These are two qualitatively different ways of having work spill into home life, and they could have different impacts on burnout. This is how Latent Class Analysis can be so useful.
In this example, we were able to use Latent Class Analysis to identify a latent typology that is used as a predictor variable, but there are many other uses within statistics, too.
So be sure to keep LCA on your radar—you never know when it might come in handy.
{ 9 comments… read them below or add one }
Hi, I would like to estimate the different factors influencing the child restraint use. I have two age groups of children, 0 to 3 years, and 4 to 7 years. My independent variables are driver age, race, gender, vehicle type etc. Can I run an LCM model using the children group as a latent class? Do you think, I am getting it conceptually right? Please suggest. Thank you very much!
Hi Meghna,
I don’t think so. The latent classes are groupings you don’t actually know, but are inferring from patterns in the data. I assume you have observed data on the child’s age?
Interesting. How do we determine the classes? Do we use Maximum Likelihood?
Hi Larry,
Yes, that’s what the LCA is doing. Finding the classes. And yes, it uses Maximum Likelihood to do so.
I have card sort data and I want to run an LCA to determine classes of people who grouped cards similarly. How would I go about doing that?
our survey is all about learning facilities and environment.these are the choices that will be used. is this a 4 point or 5 point likert scale?
4-Very satisfied 3–satisfied 2–slightly satisfied 1–not satisfied 0-No Experience in the facility
would i solve for 0 too?
Ella, It’s not really a 5-point scale b/c 0 isn’t part of the ordering. It’s a qualitatively different category. So this variable isn’t entirely ordinal, but it certainly is categorical, so you can definitely use LCA on it.
Can Latent Class Analysis be done using SPSS Statistics 23?
Not version 23. I know Stata, R, MPlus, and SAS can all do it.