While there are a number of distributional assumptions in regression models, one distribution that has no assumptions is that of any predictor (i.e. independent) variables.
It’s because regression models are directional. In a correlation, there is no direction–Y and X are interchangeable. If you switched them, you’d get the same correlation coefficient.
But regression is inherently a model about the outcome variable. What predicts its value and how well? The nature of how predictors relate to it (linearly, quadratically, multiplicatively?). How much of its variance can be explained by the predictors, and how much is just random? The focus is all on that outcome.
In fact, in a regression model, only the outcome variable is considered a random variable. This means that while we can explain or predict some of its variation, we can’t explain all of it. It is subject to some sort of random process that affects its value in any particular case.
Not true, though, the predictor variables. The predictor variables are assumed to have NO random process. And therefore, there are no assumptions about the distribution of predictor variables. None.
They don’t have to be normally distributed, continuous, or even symmetric.
But you do have to be able to interpret their coefficients. The basic interpretation of a regression coefficient is it is the size of the average difference in Y (the outcome variable) for each one-unit difference in X (the relevant independent variable) after controlling for the effects of all other Xs in the model.
If you examine that statement carefully, you’ll notice a few things.
1. You need to have a one-unit difference in X. If X is numeric and continuous, a one-unit difference in X easily makes sense.
If X is numeric and discrete (like number of children or violent episodes or job losses), a one-unit difference still makes sense.
If X is nominal categorical, a one-unit difference doesn’t make much sense on its own. What is a one-unit difference for a nominal variable like Gender? Well, if you code the two categories of Gender to be one unit apart from each other, as is done in dummy coding, or one unit apart from the grand mean, as is done in effect coding, you can force the coefficient to make sense.
But what if X is ordinal–ordered categories? There is no clever coding scheme that can preserve the order, but not treat all the one-unit differences as equivalent. So while there are no assumptions that X not be ordinal, there is no way to interpret coeffients in a meaningful way. So you are left with two options–lose the order and treat it as nominal or assume that the one-unit differences are equivalent and treat it as numeric.
Neither option is ideal. The first throws away good information, and the second assumes information that doesn’t exist. Which is better depends on how realistic is the assumption of equal unit differences and how strong the effect of ordering is.
2. While the structure of Y is different for different types of regression models (linear, logistic, Cox, etc.), as long as you take that structure into account, the interpretation of coefficients is the same. In other words, although you have to have to take the structure of Y into account, a dummy variable or a quadratic term works the same way in any regression model.
3. The unit in which X is measured matters. It may be useful to conduct a linear transformation on X to change its scaling. For example, if X is annual salaries measured in dollars, a one-dollar change is miniscule, and not very meaningful. Dividing all values of X by 1000 to change the units would make the coefficient easier to interpret.
4. The other terms in the model matter. Some coefficients are interpretable only when the model contains other terms. For example, interpretations aren’t interpretable without the terms that make them up (lower-order terms). And including an interaction changes the meaning of those lower-order terms from main effects to marginal effects.
Likewise, the units of the other terms in the model can affect the coefficients themselves. The reference category chosen for a dummy variable or vast scale differences in the units of measurement of different variables (i.e one variable is on a 0 to 1 scale and another is on a 1 to 100,000 scale) can affect coeffients. In the latter example, a predictor with much larger scale can dominate the regression model, just by how it is measured. Likewise, if a predictor is highly skewed, those extreme values can have undue influence on the regression coefficients.
So even though there are no theoretical assumptions about the distribution of predictor variables, paying attention to their scales of measurement, their distributions, and how they fit into the overall model makes good practical sense.
If you want to learn in detail what the assumptions really mean, how to check them, and what to do if they’re not met, check out our Assumptions of Linear Models Workshop.