In all linear regression models, the intercept has the same definition: the mean of the response, Y, when all predictors, all X = 0.

But “when all X=0” has different implications, depending on the scale on which each X is measured and on which terms are included in the model.

So let’s specifically discuss the meaning of the intercept in some common models:

### Example 1: Both X_{1} and X_{2} are Numerical and Uncentered

This is the model you learn the most about in regression classes.

In this model, the intercept is not always meaningful.

Since the intercept is the mean of Y when all predictors equals zero, the mean is only useful if every X in the model actually *has* some values of zero.

If they do, no problem.

But if one predictor is a variable like age of employees in a company, there should be no values even close to zero. So while the intercept will be necessary for calculating predicted values, it has to no real meaning.

And what’s more, in this type of model, it’s rare to have any hypotheses about the intercept, so you may have been taught to ignore it. That works here, but as you’ll see, the intercept can become a meaningful comparison point in other types of models.

### Example 2: Both X_{1} and X_{2} are Numerical and Centered At Their Mean

But if we center each predictor variable at its own mean, we rescale the mean to zero. So all X equals zero at their mean.

So the intercept is simply the mean of Y at the mean value of each of the predictor variables.

You still may not have hypotheses about it, but it at least is a meaningful value.

### Example 3: Both X_{1} and X_{2} are Categorical and Dummy Coded

Dummy coded predictor variables have only two possible values: 0 and 1. Zero always refers to the reference group for each dummy coded predictor.

Hopefully it’s clear that in this model the intercept will be the mean of Y for both predictors’ reference groups.

So this is an example where the intercept become meaningful and useful for answering hypotheses. The other coefficients in the model will be differences between this mean and the means for the comparison groups.

### Example 4: X_{1} is Numerical and Centered and X_{2} is Categorical and Dummy Coded

Here we just combine exactly what we’ve been doing in the other examples.

The intercept would be the mean of Y at the mean of X_{1} for only the reference group of X_{2}.

This would be a very useful value to have, especially if X_{1} is a covariate and X_{2} an independent variable. The coefficient for X_{2} is the difference between this reference group mean (the intercept) and the comparison group mean, evaluated at the mean of the covariate.

### Example 5: Both X_{1} and X_{2} are Categorical and Effect Coded

Effect coding is a different way of assigning numerical values to categories so that they work in a linear model. It is the coding scheme that ANOVA uses.

While it is not usually the default in regression, it can be very useful.

The way effect coding works is to assign values of -1 and 1 to the categories. What this does is place zero between the two categories.

As long data are balanced across the two categories, the mean of Y when all X equal zero will be the overall grand mean of Y.

The reason this works is because even though there were no data where X_{1} and X_{2} equal zero, it’s right in the middle of the two categories. Since the mean of two group means is equal to the mean of all the points, this value in the middle ends up being the overall grand mean.

### Example 6: X_{1} is Numerical and Centered and X_{2} is Categorical and Effect Coded

You probably know where I’m going with this one. Centering for a numerical variable does the same basic thing as effect coding for a categorical variable – it puts zero in the middle.

In this model, the intercept is the mean of Y at the mean of X_{1} across both groups of X_{2}.

If you compare this to example 4, you see the intercept has a different meaning, even though both examples include one numerical and one categorical predictor. In that model, we were evaluating that mean Y for only one group of X_{2} because X_{2} was dummy coded.

Here, the effect coding means we’re averaging across both groups.

Both approaches are helpful and meaningful in different situations. You just need to choose which information helps you understand your data and how they apply to your research questions.

### Beyond The Six Examples

In all of these models, because we did not include any multiplicative terms like interactions or polynomials, all of the coding and scaling changes affected the intercepts, but not the model slopes.

They affect the intercepts in exactly the same way in models with multiplicative terms, but they also affect some of the slope coefficients.

So all of the intercept interpretations I’ve outlined above have the exact same interpretations whether multiplicative terms are in the model or not. Those would be exactly the same.

I suggest you take a simple data set, try some different coding schemes, and try it out for yourself. That’s often the best way to cement your understanding.

{ 2 comments… read them below or add one }

Example 3 should mention that the model needs to include an interaction effect. Otherwise the intercept will not be equal to the group mean of a group that is described by having both 0 on both predictor variables. Instead it will be a predicted mean of a group coded all 0’s under an assumption of additivity. As soon as this is violated, the intercept will not be identical to an observed group mean anymore.

True, true. Good point. This does assume that any interaction effect really = 0.