Multicollinearity can affect any regression model with more than one predictor. It occurs when two or more predictor variables overlap so much in what they measure that their effects are indistinguishable.
When the model tries to estimate their unique effects, it goes wonky (yes, that’s a technical term).
So for example, you may be interested in understanding the separate effects of altitude and temperature on the growth of a certain species of mountain tree.
One of the most common—and one of the trickiest—challenges in data analysis is deciding how to include multiple predictors in a model, especially when they’re related to each other.
Here’s an example. Let’s say you are interested in studying the relationship between work spillover into personal time as a predictor of job burnout.
You have 5 categorical yes/no variables that indicate whether a particular symptom of work spillover is present (see below).
While you could use each individual variable, you’re not really interested if one in particular is related to the outcome. Perhaps it’s not really each symptom that’s important, but the idea that spillover is happening.
Multicollinearity occurs when two or more predictor variables in a regression model are redundant. It is a real problem, and it can do terrible things to your results. However, the dangers of multicollinearity seem to have been so drummed into students’ minds that it created a panic.
True multicolllinearity (the kind that messes things up) is pretty uncommon. High correlations among predictor variables may indicate multicollinearity, but it is NOT a reliable indicator that it exists. It does not necessarily indicate a problem. How high is too high depends on (more…)