Strategies for Choosing the Reference Category in Dummy Coding

Every statistical software procedure that dummy codes predictor variables uses a default for choosing the reference category.

This default is usually the category that comes first or last alphabetically.

That may or may not be the best category to use, but fortunately you’re not stuck with the defaults.

So if you do choose, which one should you choose?

The first thing to remember is that ultimately, it doesn’t really matter, as long as you are aware of which category is the reference. You’re going to get the same results no matter what you choose. It’s just that the specific comparisons that the software reports (and gives you p-values for) will differ.

So it’s best to choose a category that makes interpretation of results easier. Here are a few common options for choosing a category.

Remember, the regression coefficients will give you the difference in means (and/or slopes if you’ve included an interaction term) between each other category and the reference category.

Strategy 1: Use the normative category

In many cases, the most logical or important comparisons are to the most normative group. For example, in one data set I analyzed, an important dummy-coded predictor is Poverty Status: In Poverty or Not In Poverty.

Not In Poverty is the norm–most people aren’t in Poverty (at least in this data set–it may not be true in the population you’re studying). The interesting comparison is to see how people in poverty differ from this normative group. So making Not In Poverty the reference group just makes sense.

Likewise, another example is Marital Status: Never Married, Currently Married, Divorced, Separated, or Widowed.

The alphabetical default would make Widowed the reference group. But it’s not as interesting to compare Separated people to Widowed people, as they’re both small groups in the data set, and the most interesting comparisons are with the normative categories of Never Married or Currently Married.

In experiments or randomized control trials the control group is a natural normative category. The only exception I can think of is a study with multiple controls, but only one intervention or treatment group. In that case, it may be more important to measure any differences between the treatment and each control.

Strategy 2: Use the largest category

The other problem with using the Widowed group as the reference is it’s very, very small. When sample sizes are very unequal in the groups, which is very common for naturally occurring groups, it can become problematic to use it as the reference.

Sometimes, if there isn’t a normative group in a logical sense, it makes sense to just use the largest category as the reference.

Strategy 3: Use the category whose mean is in the middle, or conversely, at one of the ends

Sometimes all of these options fail. There is no obvious norm and sample sizes are similar.

In those cases, sometimes the best thing to do is to pick the category with the lowest, the highest, or the middle mean. Let me give you an example.

Let’s say those 5 marital categories have means on Y of

10 Never Married

11 Currently Married

9 Divorced

15 Separated

19 Widowed

If the overall F test in the ANOVA table is significant for this variable, you already know that the highest and lowest means are significantly different. You just don’t know which of the middle three are significantly different from each of those.

For example, the middle value here is 11, the mean for currently married folks. If you use that as the reference group and discover that it is significantly lower than 15, the mean for separated folks and 19, the mean for widowed, you know that both 9 for Divorced and 10 for Never Married should be too. (Note, this doesn’t always hold if some groups have much smaller sample sizes, but as long as they’re reasonably equal, it should hold).

You won’t know, for example, if there is a significant difference between the means for the Separated and Widowed groups, but if that’s not a theoretically important comparison, you’re done.

This particular strategy doesn’t always work, but you can use it to your advantage when it does.

Interpreting Linear Regression Coefficients: A Walk Through Output

Learn the approach for understanding coefficients in that regression as we walk through output of a model that includes numerical and categorical predictors and an interaction.

Comments

Garima Jain says

March 15, 2024 at 8:04 am

Thanks for the post. Very helpful. I have a situation where two of my variables of interest both have reference categories and the significance drops completely if I change one of the reference categories to another.

I am studying migration (Y= migrant/non-migrant) dependent on 4 categories of flooding (high intensity high frequency, high frequency low intensity, high intensity low frequency, low frequency low intensity) and whether the person is minor/adult male/female. I am also interacting the two independent variables (flooding and genderAge). When I set the flooding reference category as HH, all movements and coeff are significant, whereas when I set the reference as LL, many of the significance levels get wiped out. How do I interpret these results then?

Reply
sam Wafula says

December 31, 2023 at 2:42 pm

What guides me in selecting the reference category is by and large the underlying theory. For instance, if one of the independent variables is age in the study on contraceptive preferences, then theoretically, it is expected that older women say 40+ years are likely to have achieved their desired fertility and therefore could prefer long acting and permanent methods of family planning as compared to the rest. Conversely, the youngest women (say 1 making our interpretation to be easy.

Reply
Afreen says

April 15, 2023 at 8:01 am

I am confusing about reference category. why we change it. or it will be beneficial or not.

Reply
- Karen Grace-Martin says
  
  May 5, 2023 at 1:58 pm
  
  You don’t have to change it. Depending on your research question, sometimes changing it give you a comparison that most closely answers the question.
  
  Reply
Afreen says

April 15, 2023 at 7:36 am

I have education variable . in education there are six category (no education, incomplete primary education, complete primary education, incomplete secondary education, complete secondary education and higher education) . If I am doing Poisson regression on stata. They automatically choose reference category as a no education. I want to check the effect of higher education will decrease the fertility. if I choose reference category is higher education then it will be good or not?

Reply
- Karen Grace-Martin says
  
  May 5, 2023 at 2:01 pm
  
  If you want to see if higher education has a lower mean fertility that the others, then yes, that would be a good idea.
  
  Reply
- sam Wafula says
  
  December 31, 2023 at 2:56 pm
  
  You can use the “char” comment to select your desired category. For example assume we name our variable as Educ and assign your categories as 1, 2, 3,4, 5 and 6 respectively, we can circumvent the stata software automatic choice of the first choice of one by using the following command:
  
  char Educ[omit]3
  in this case, we have chosen 3 (complete primary education) to be our reference category.
  
  I hope this helps
  
  Reply
Stijn says

April 13, 2023 at 10:09 am

Hi, I am observing data from june – mrt, now I want to see the effects of being in a specific month so i added dummies to the months (1, 0). As far as I now I should leave one month out of the regression which then becomes the base level? However if I do this and for example pick nov as a base (leave it out) then the results are highly significant. If I leave june out, the coefficients are the same but the results are insignificant for some variables. How is this possible?

Reply
- Karen Grace-Martin says
  
  May 5, 2023 at 2:06 pm
  
  I’m not sure if you mean the coefficients that have higher p-values for June are those for the months themselves or for other variables in the model. I’m assuming the latter. It could be a few different things. If June has a smaller sample size, using it as the base category can leave you with smaller power. Or, if the mean for June is right in the middle of the means, but November is the highest or lowest, there could be months whose means are different from Nov but not June.
  
  Reply
Farah Mneimneh says

March 30, 2023 at 8:11 pm

The coefficients of the control/dummy variables are not changing when I change the reference level of the independent variable. Is this normal? You mentioned that the coefficients might be affected if the reference level is of small counts compared to other levels.

Reply
- Karen Grace-Martin says
  
  April 11, 2023 at 10:30 am
  
  I’m not sure if you mean the coefficient of the variable you’re changing or if you mean coefficients of other variables. If the one you’re changing reference levels for has only two values, its coefficient should change sign. Other coefficients shouldn’t change as long as there are no interactions.
  
  Reply
GIRMA HUKA says

August 6, 2022 at 10:48 am

THANKS FOR CLEAR EXPLANATION!

Reply
- GIRMA HUKA says
  
  August 6, 2022 at 10:50 am
  
  THANKS FOR CLEAR EXPLANATION.GIRMA HUKA DUKALE WEST GUJI ZONE,OROMIA,ETHIOPIA
  
  Reply
Reema says

February 5, 2021 at 2:43 pm

I am a beginner in data science. Just had a general doubt regarding the reference category.
So my doubt is whether the reference categories are always assumed to be significant by default as while giving business recommendatins we compare the remaining categories with reference eg so and so item “x” is more popular then the “reference category item” and hence client should consider producing more “x” items then the “reference item”.

Reply
- Karen Grace-Martin says
  
  December 2, 2021 at 9:56 am
  
  Hi Reema,
  It all depends on what you’re trying to test. If you’re trying to understand, say, whether customers like item A more than the reference item B, then yes, your test is about the difference in the mean of liking between those two items. The coefficient you care about then is item A’s. That’s the difference between A and B.
  
  Reply
abdulaziz says

September 26, 2020 at 11:33 pm

how can I select reference category in stata 9

Reply
Leonardo Castilho says

January 28, 2019 at 4:18 pm

Why using small sample groups as reference is problematic?

Reply
- Karen Grace-Martin says
  
  March 4, 2019 at 11:31 am
  
  Hi Leonardo,
  
  It’s generally a lack of power. Your power is determined by your smaller group.
  
  Reply
Rousset says

April 21, 2018 at 3:51 am

How do I chose the Reference Category in STATA, so that it is not arbitrary the last alphabetical one?
Presently, I am doing an xtreg in STATA and the omitted variable is the last one. I would like to chose another one so that results are easier to interpretate.
Thanks

Reply
- Azadeh says
  
  December 1, 2021 at 6:13 pm
  
  use ib#.[variable_name]. b stands for base and # is the number for that category in your variable.
  
  Reply
Shalaw says

June 17, 2016 at 11:45 am

I am going to analyze a situation where there are 300 non-injury and only 17 injury… four categorical variables are significant according to Chi-squire, then I used Multiple logistic regression for significant variables. Three of them are significant again. does it make any sense? I would like to know whether can I use Multiple logistic regression because only 17 respondent had injured from 317 of the respondents. I used SPSS to data analysis.

Reply

Strategy 1: Use the normative category

Strategy 2: Use the largest category

Strategy 3: Use the category whose mean is in the middle, or conversely, at one of the ends

Reader Interactions

Comments

Leave a Reply Cancel reply