In all linear regression models, the intercept has the same definition: the mean of the response, Y, when all predictors, all X = 0.
But “when all X=0” has different implications, depending on the scale on which each X is measured and on which terms are included in the model.
So let’s specifically discuss the meaning of the intercept in some common models:
, each of which has two predictor variables, X1 and X2. The interpretations easily expand for models with more predictors of each type.
Example 1: Both X1 and X2 are Numerical and Uncentered
This is the model you learn the most about in regression classes.
In this model, the intercept is not always meaningful.
Since the intercept is the mean of Y when all predictors equals zero, the mean is only useful if every X in the model actually has some values of zero.
If they do, no problem.
But if one predictor is a variable like age of employees in a company, there should be no values even close to zero. So while the intercept will be necessary for calculating predicted values, it has to no real meaning.
And what’s more, in this type of model, it’s rare to have any hypotheses about the intercept, so you may have been taught to ignore it. That works here, but as you’ll see, the intercept can become a meaningful comparison point in other types of models.
Example 2: Both X1 and X2 are Numerical and Centered At Their Mean
But if we center each predictor variable at its own mean, we rescale the mean to zero. So all X equals zero at their mean.
So the intercept is simply the mean of Y at the mean value of each of the predictor variables.
You still may not have hypotheses about it, but it at least is a meaningful value.
Example 3: Both X1 and X2 are Categorical and Dummy Coded
Dummy coded predictor variables have only two possible values: 0 and 1. Zero always refers to the reference group for each dummy coded predictor.
Hopefully it’s clear that in this model the intercept will be the mean of Y for both predictors’ reference groups.
So this is an example where the intercept become meaningful and useful for answering hypotheses. The other coefficients in the model will be differences between this mean and the means for the comparison groups.
Example 4: X1 is Numerical and Centered and X2 is Categorical and Dummy Coded
Here we just combine exactly what we’ve been doing in the other examples.
The intercept would be the mean of Y at the mean of X1 for only the reference group of X2.
This would be a very useful value to have, especially if X1 is a covariate and X2 an independent variable. The coefficient for X2 is the difference between this reference group mean (the intercept) and the comparison group mean, evaluated at the mean of the covariate.
Example 5: Both X1 and X2 are Categorical and Effect Coded
Effect coding is a different way of assigning numerical values to categories so that they work in a linear model. It is the coding scheme that ANOVA uses.
While it is not usually the default in regression, it can be very useful.
The way effect coding works is to assign values of -1 and 1 to the categories. What this does is place zero between the two categories.
As long data are balanced across the two categories, the mean of Y when all X equal zero will be the overall grand mean of Y.
The reason this works is because even though there were no data where X1 and X2 equal zero, it’s right in the middle of the two categories. Since the mean of two group means is equal to the mean of all the points, this value in the middle ends up being the overall grand mean.
Example 6: X1 is Numerical and Centered and X2 is Categorical and Effect Coded
You probably know where I’m going with this one. Centering for a numerical variable does the same basic thing as effect coding for a categorical variable – it puts zero in the middle.
In this model, the intercept is the mean of Y at the mean of X1 across both groups of X2.
If you compare this to example 4, you see the intercept has a different meaning, even though both examples include one numerical and one categorical predictor. In that model, we were evaluating that mean Y for only one group of X2 because X2 was dummy coded.
Here, the effect coding means we’re averaging across both groups.
Both approaches are helpful and meaningful in different situations. You just need to choose which information helps you understand your data and how they apply to your research questions.
Beyond The Six Examples
In all of these models, because we did not include any multiplicative terms like interactions or polynomials, all of the coding and scaling changes affected the intercepts, but not the model slopes.
They affect the intercepts in exactly the same way in models with multiplicative terms, but they also affect some of the slope coefficients.
So all of the intercept interpretations I’ve outlined above have the exact same interpretations whether multiplicative terms are in the model or not. Those would be exactly the same.
I suggest you take a simple data set, try some different coding schemes, and try it out for yourself. That’s often the best way to cement your understanding.
PF Duralwes says
Hi Karen, thanks for the post. My question relates specifically to Example 4 (2 dummy coded categorical X variables) though I am interested in the answer for the other examples as well. You are writing here about how to interpret the coefficients. What about the p-values? It seems to me that while the coefficient of a variable indicates the difference between it and the reference (intercept), the p-value indicates something else… the difference between it and zero? Thanks!
DAGNACHEW GETNET says
how can I correct the marginal prediction value becomes negative and greater 1 in multivariate probit ?
Example 3 should mention that the model needs to include an interaction effect. Otherwise the intercept will not be equal to the group mean of a group that is described by having both 0 on both predictor variables. Instead it will be a predicted mean of a group coded all 0’s under an assumption of additivity. As soon as this is violated, the intercept will not be identical to an observed group mean anymore.
True, true. Good point. This does assume that any interaction effect really = 0.