Last week I had the pleasure of teaching a webinar on Interpreting Regression Coefficients. We walked through the output of a somewhat tricky regression model—it included two dummy-coded categorical variables, a covariate, and a few interactions.

As always seems to happen, our audience asked an amazing number of great questions. (Seriously, I’ve had multiple guest instructors compliment me on our audience and their thoughtful questions.)

We had so many that although I spent about 40 minutes answering (more…)

A data set can contain indicator (dummy) variables, categorical variables and/or both. Initially, it all depends upon how the data is coded as to which variable type it is.

For example, a categorical variable like marital status could be coded in the data set as a single variable with 5 values: (more…)

In the last post, we examined how to use the same sample when running a set of regression models with different predictors.

Adding a predictor with missing data causes cases that had been included in previous models to be dropped from the new model.

Using different samples in different models can lead to very different conclusions when interpreting results.

Let’s look at how to investigate the effect of the missing data on the regression models in Stata.

The coefficient for the variable “frequent religious attendance” was negative 58 in model 3 and then rose to a positive 6 in model 4 when income was included. Results (more…)

Whenever I get email questions whose answers I think would benefit others, I like to answer them here. I leave out the asker’s name for privacy, but this is a great question about dummy coding:

First of all, thanks for all those helpful information you provided! Thanks sincerely for all your efforts!

Actually I am here to ask a technical question. See, I have 6 locations (let’s say A, B, C, D, E, and F), and I want to see the location effect on the outcome using OLS models.

I know that if I included 5 dummy location variables (6 locations in total, with A as the reference group) in 1 block of the regression analysis, the result would be based on the comparison with the reference location.

Then what if I put 6 dummies (for example, the 1st dummy would be “1” for A location, and “0” for otherwise) in 1 block? Will it be a bug? If not, how to interpret the result?

Thanks a lot!

Great question!

If you put in a 6th dummy code for Location A, your reference group, the model will actually blow up. (Yes, that’s a technical term).

This is one of those cases of pure multicollinearity, and the model can’t be estimated uniquely.

It’s the same situation you learned back in Algebra where you have two equations, one unknown. The problem isn’t that it can’t be solved–the problem is there are an infinite number of equally good solutions.

If an observation falls in Location A, the reference group, we’ve already gotten that information from the other 5 dummy variables. That observation would have a 0 on all of them. So we already know it’s location is A. We don’t need another dummy variable to tell the model that. It’s redundant information. And so perfectly redundant that the model will choke.

Dummy coding is one of the topics I get the most questions about. It can get especially tricky to interpret when the dummy variables are also used in interactions, so I’ve created some resources that really dig in deeply.

Here’s a little quiz:

### True or False?

1. When you add an interaction to a regression model, you can still evaluate the main effects of the terms that make up the interaction, just like in ANOVA.

2. The **intercept **is usually meaningless in a regression model. (more…)

### Multinomial Logistic Regression

The multinomial (a.k.a. polytomous) logistic regression model is a simple extension of the binomial logistic regression model. They are used when the dependent variable has more than two nominal (unordered) categories.

Dummy coding of independent variables is quite common. In multinomial logistic regression the *dependent* variable is dummy coded into multiple 1/0 variables. There is a variable for all categories but one, so if there are M categories, there will be M-1 dummy variables. All but one category has its own dummy variable. Each category’s dummy variable has a value of 1 for its category and a 0 for all others. One category, the reference category, doesn’t need its own dummy variable as it is uniquely identified by all the other variables being 0.

The multinomial logistic regression then estimates a separate binary logistic regression model for each of those dummy variables. The result is (more…)