Whenever I get email questions whose answers I think would benefit others, I like to answer them here. I leave out the asker’s name for privacy, but this is a great question about dummy coding:
First of all, thanks for all those helpful information you provided! Thanks sincerely for all your efforts!
Actually I am here to ask a technical question. See, I have 6 locations (let’s say A, B, C, D, E, and F), and I want to see the location effect on the outcome using OLS models.
I know that if I included 5 dummy location variables (6 locations in total, with A as the reference group) in 1 block of the regression analysis, the result would be based on the comparison with the reference location.
Then what if I put 6 dummies (for example, the 1st dummy would be “1” for A location, and “0” for otherwise) in 1 block? Will it be a bug? If not, how to interpret the result?
Thanks a lot!
If you put in a 6th dummy code for Location A, your reference group, the model will actually blow up. (Yes, that’s a technical term).
This is one of those cases of pure multicollinearity, and the model can’t be estimated uniquely.
It’s the same situation you learned back in Algebra where you have two equations, one unknown. The problem isn’t that it can’t be solved–the problem is there are an infinite number of equally good solutions.
If an observation falls in Location A, the reference group, we’ve already gotten that information from the other 5 dummy variables. That observation would have a 0 on all of them. So we already know it’s location is A. We don’t need another dummy variable to tell the model that. It’s redundant information. And so perfectly redundant that the model will choke.
Dummy coding is one of the topics I get the most questions about. It can get especially tricky to interpret when the dummy variables are also used in interactions, so I’ve created some resources that really dig in deeply.