I recently received a great question in a comment about whether the assumptions of normality, constant variance, and independence in linear models are about the residuals or the response variable.
The asker had a situation where Y, the response, was not normally distributed, but the residuals were.
Quick Answer: It’s just the residuals.
In fact, if you look at any (good) statistics textbook on linear models, you’ll see below the model, stating the assumptions:
ε~ i.i.d. N(0, σ²)
That ε is the residual term (and it ought to have an i subscript–one for each individual). The i.i.d. means every residual is independent and identically distributed. They all have the same distribution, which is defined right afterward.
You’ll notice there is nothing similar about Y. ε’s distribution is influenced by Y’s, which is why Y has to be continuous, unbounded, and measured on an interval and ratio scale.
But Y’s distribution is also influenced by the X’s. ε’s isn’t. That’s why you can get a normal distribution for ε, but lopsided, chunky, or just plain weird-looking Y.
________________________________________________________
If you want to learn in detail what the assumptions really mean, how to check them, and what to do if they’re not met, check out our Assumptions of Linear Models Workshop.
Send to Kindle




{ 2 comments… read them below or add one }
Hi Karen,
Since Y = E(Y) + ε, and E(Y) is a constant (function of X’s and betas), this should imply that the variance, independence and distributional assumptions on ε applies to Y as well. Am I right to say this?
Hi Kevin,
One small change that makes all the difference: Y=E(Y|X) + e. If every individual had the same value of X, then yes, the distribution of Y would match that of e. Since they generally differ, the Y’s are affected by the X’s but the residuals aren’t.
The distribution of Y|X is the same as the distribution of e, but the distribution of Y isn’t necessarily. I’ve seen many data sets where Y is skewed, but e is normal.
Karen