Multicollinearity can affect any regression model with more than one predictor. It occurs when two or more predictor variables overlap so much in what they measure that their effects are indistinguishable.
When the model tries to estimate their unique effects, it goes wonky (yes, that’s a technical term).
So for example, you may be interested in understanding the separate effects of altitude and temperature on the growth of a certain species of mountain tree.
Altitude and temperature are distinct concepts, but the mean temperature is so correlated with the altitude at which the tree is growing that there is no way to separate out their effects.
But it’s not always easy to tell that the wonkiness in your model comes from multicollinearity.
One popular detection method is based on the bivariate correlation between two predictor variables. If it’s above .8 (or .7 or .9 or some other high number), the rule of thumb says you have multicollinearity.
And it is certainly true that a high correlation between two predictors is an indicator of multicollinearity. But there are two problems with treating this rule of thumb as a rule.
First, how high that correlation has to be before you’re finding inflated variances depends on the sample size. There is no one good cut off number.
Second, it’s possible that while no two variables are highly correlated, three or more together are multicollinear. Weird idea, I know. But it happens.
You’ll completely miss the multicollinearity in that situation if you’re just looking at bivariate correlations.
So like a lot of things in statistics, when you’re checking for multicollinearity, you have to check multiple indicators and look for patterns among them. Sometimes just one is all it takes and sometimes you need to see patterns among a few.
Here are seven more indicators of multicollinearity.
1. Very high standard errors for regression coefficients
When standard errors are orders of magnitude higher than their coefficients, that’s an indicator.
2. The overall model is significant, but none of the coefficients are
Remember that a p-value for a coefficient tests whether the unique effect of that predictor on Y is zero. If all predictors overlap in what they measure, there is little unique effect, even if the predictors as a group have an effect on Y.
3. Large changes in coefficients when adding predictors
If the predictors are completely independent of each other, their coefficients won’t change at all when you add or remove one. But the more they overlap, the more drastically their coefficients will change.
4. Coefficients have signs opposite what you’d expect from theory
Be careful here as you don’t want to disregard an unexpected finding as problematic. Not all effects opposite theory indicate a problem with the model. That said, it could be multicollinearity and warrants taking a second look at other indicators.
5. Coefficients on different samples are wildly different
If you have a large enough sample, split the sample in half and run the model separately on each half. Wildly different coefficients in the two models could be a sign of multicollinearity.
6. High Variance Inflation Factor (VIF) and Low Tolerance
These two useful statistics are reciprocals of each other. So either a high VIF or a low tolerance is indicative of multicollinearity. VIF is a direct measure of how much the variance of the coefficient (ie. its standard error) is being inflated due to multicollinearity.
7. High Condition Indices
Condition indices are a bit strange. The basic idea is to run a Principal Components Analysis on all predictors. If they have a lot of shared information, the first Principal Component will be much higher than the last. Their ratio, the Condition Index, will be high if multicollinearity is present.