The Distribution of Independent Variables in Regression Models

Stage 2While there are a number of distributional assumptions in regression models, one distribution that has no assumptions is that of any predictor (i.e. independent) variables.

It’s because regression models are directional. In a correlation, there is no direction–Y and X are interchangeable. If you switched them, you’d get the same correlation coefficient.

But regression is inherently a model about the outcome variable. What predicts its value and how well? The nature of how predictors relate to it (linearly, quadratically, multiplicatively?). How much of its variance can be explained by the predictors, and how much is just random? The focus is all on that outcome.

In fact, in a regression model, only the outcome variable is considered a random variable. This means that while we can explain or predict some of its variation, we can’t explain all of it. It is subject to some sort of random process that affects its value in any particular case.

Not true, though, the predictor variables. The predictor variables are assumed to have NO random process. And therefore, there are no assumptions about the distribution of predictor variables. None.

They don’t have to be normally distributed, continuous, or even symmetric.

But you do have to be able to interpret their coefficients. The basic interpretation of a regression coefficient is it is the size of the average difference in Y (the outcome variable) for each one-unit difference in X (the relevant independent variable) after controlling for the effects of all other Xs in the model.

If you examine that statement carefully, you’ll notice a few things.

1. You need to have a one-unit difference in X. If X is numeric and continuous, a one-unit difference in X easily makes sense.

If X is numeric and discrete (like number of children or violent episodes or job losses), a one-unit difference still makes sense.

If X is nominal categorical, a one-unit difference doesn’t make much sense on its own. What is a one-unit difference for a nominal variable like Gender? Well, if you code the two categories of Gender to be one unit apart from each other, as is done in dummy coding, or one unit apart from the grand mean, as is done in effect coding, you can force the coefficient to make sense.

But what if X is ordinal–ordered categories? There is no clever coding scheme that can preserve the order, but not treat all the one-unit differences as equivalent. So while there are no assumptions that X not be ordinal, there is no way to interpret coefficients in a meaningful way. So you are left with two options–lose the order and treat it as nominal or assume that the one-unit differences are equivalent and treat it as numeric.

Neither option is ideal. The first throws away good information, and the second assumes information that doesn’t exist. Which is better depends on how realistic is the assumption of equal unit differences and how strong the effect of ordering is.

2. While the structure of Y is different for different types of regression models (linear, logistic, Cox, etc.), as long as you take that structure into account, the interpretation of coefficients is the same. In other words, although you have to have to take the structure of Y into account, a dummy variable or a quadratic term works the same way in any regression model.

3. The unit in which X is measured matters. It may be useful to conduct a linear transformation on X to change its scaling. For example, if X is annual salaries measured in dollars, a one-dollar change is miniscule, and not very meaningful. Dividing all values of X by 1000 to change the units would make the coefficient easier to interpret.

4. The other terms in the model matter. Some coefficients are interpretable only when the model contains other terms. For example, interpretations aren’t interpretable without the terms that make them up (lower-order terms). And including an interaction changes the meaning of those lower-order terms from main effects to marginal effects.

Likewise, the units of the other terms in the model can affect the coefficients themselves. The reference category chosen for a dummy variable or vast scale differences in the units of measurement of different variables (i.e one variable is on a 0 to 1 scale and another is on a 1 to 100,000 scale) can affect coefficients. In the latter example, a predictor with much larger scale can dominate the regression model, just by how it is measured. Likewise, if a predictor is highly skewed, those extreme values can have undue influence on the regression coefficients.

So even though there are no theoretical assumptions about the distribution of predictor variables, paying attention to their scales of measurement, their distributions, and how they fit into the overall model makes good practical sense.

________________________________________________________

 

Four Critical Steps in Building Linear Regression Models
While you’re worrying about which predictors to enter, you might be missing issues that have a big impact your analysis. This training will help you achieve more accurate results and a less-frustrating model building experience.

Reader Interactions

Comments

  1. Iuval Clejan says

    If there are known variances in dependent variable, their inverse can be used as a weight in computing the chi square sum in the regression analysis. But what if we also have information about the independent variables’ variances? Can we use that information somehow? Naively, I would say (for one independent variable x_i and one dependent variable y_i) to find the regression coefficient r that minimizes chisquare=Sum_i[(r x_i -y_i)^2/(r^2 sigmax_i^2+sigmay_i^2)].
    Is this correct?

  2. Kazeem says

    hello everyone
    I’m working on determinants of food security, where my DV is the food security categories, although food security is ordinal in nature according to their severity but i applied a multinomial regression analysis which is categorical, although my assumption was that the IVs affects the food security categories differently, my results are more robust when i applied multinomial logistic than ordered logit, have i violated any assumption? and if yes what are the implications?, thank you.

    • Karen Grace-Martin says

      Hi Kazeem,
      I don’t give statistical advise here because I generally have to ask so many questions to understand the full research context. But I will say that generally multinomial logistic regression has fewer assumptions than ordered.

  3. Ela says

    Hey,
    I´d like to quote your article in my study, but I don´t know how. Could you tell me the year you published your article on this website?

  4. Zahid says

    ” I checked my books and your statement is implied by stating that the residuals have to be normally distributed.” yes residual must but what about independent variables the follow the normality assumption or not

    • HIba AH says

      No, the only assumption for regression models is the normality and the variance homogeneity of the error (residuals) and not the independent variable(s) or dependant variable.

  5. Alexy says

    Hello,

    I agree with you “there are no assumptions about the distribution of predictor variables”. However, few statisticians at work think otherwise. Some modelers agree with me (and you). Do you have a book reference regarding that? I checked my books and your statement is implied by stating that the residuals have to be normally distributed. However, no mention of independent vars.

    Thanks,

    Alexy


Leave a Reply

Your email address will not be published. Required fields are marked *

Please note that, due to the large number of comments submitted, any questions on problems related to a personal study/project will not be answered. We suggest joining Statistically Speaking, where you have access to a private forum and more resources 24/7.