I often hear concern about the non-normal distributions of independent variables in regression models, and I am here to ease your mind.
There are NO assumptions in any linear model about the distribution of the independent variables. Yes, you only get meaningful parameter estimates from nominal (unordered categories) or numerical (continuous or discrete) independent variables. But no, the model makes no assumptions about them. They do not need to be normally distributed or continuous.
It is useful, however, to understand the distribution of predictor variables to find influential outliers or concentrated values. A highly skewed independent variable may be made more symmetric with a transformation.
Ariel Balter says
@Stefan Ehrlich this is a classic case of zero-inflated data. You will only have non-zero data for number of cigarettes smoked per day for smokers. So you have something like P[smoke N cigs / day] = I*D[smoke N cigs / day] where D is the distribution AMONG SMOKERS and I is an indicator variable for whether or not the subject is a smoker. There is tons of information out there about zero-inflated data and appropriate analysis methods.
Ariel Balter says
Does it make any sense at all to speak of the distribution of a predictor? In the context of regression, independent variables are not considered random. For instance, suppose you have data for blood pressure vs age, and you happen to have some distribution of ages that were surveyed, does it make sense to think of age as a random variable with a distribution?
Karen Grace-Martin says
There are situations where it makes sense to think of a predictor as random. This is called Type II regression or Major Axis regression.
Chao Yue says
I think Ariel Balter’s point is that to talk about the *statistical* distribution of the independent variables is actually inappropriate, because after all in the framework of linear regressions (if we limit to the OLS method) the independent variables are not considered as random but rather fixed values, while the linear model is doing the job to predict the conditional distribution of the response variable given a combination of independent variable values. I did not realize this until I have this question. Hope I made my point clear because english is not my native language …
Karen Grace-Martin says
Yes, very clear. I agree with Ariel and that was my point in the original article. Xs are not random variable in regression models, so therefore we don’t consider their statistical distribution. But in Type II regression (also called Major Axis Regression) both X and Y are assumed to be random variables, so it makes sense to look at X’s distribution.
hi Karen, I need a book that explains that the independent variable does not need to be normally distributed on the regression analysis. can you give the title of the book?
I don’t know that you’ll ever find that statement in a book. It’s one of those things where it’s just absent. All books about regression state the assumption of normality as Y|X or the errors. But as for X’s distribution, you might find something that says that X is not a random variable and is fixed. But that’s it.
If Tyan is really looking for an explicit mention of the fact that no assumptions are made about the distribution of independent variables other than independence between the independent variables and the error, then they can find one in Fox, J (2016). Applied Regression Analysis & Generalized Linear Models (3rd edition) on page 318:
“(…) the general linear model, which makes no distributional assumptions about the Xs, other than independence between the Xs and the errors.”
This is mentioned in the context of discrete explanatory variables.
Karen, I’m really glad to hear that there is no assumption for independent data to be normally distributed; however, mine are seriously skewed: they’re ratios, with many values close to 1 and just a few above 10. The residual plots against the dependents look pretty clumped too. Logging or square rooting doesn’t help. However, my GLM seems happy and the results make sense – does this mean I can trust them?
Thanks for this helpful post. I came across a text (Applied Predictive Analytics by Dean Abbott) stating that it is useful to correct any skewness in the predictor variables, and this article helped remind me of the assumptions of a linear model. The reasoning cited in the text is that tails of predictor variables have a disproportionate impact on the slope of the line, which I don’t think is appropriate as if the model is truly linear, then outliers are very helpful in reducing uncertainty in the regression line, which is demonstrated in formula of variance for the slope coefficient, which is inversely proportional to the squared distance of the predictor from its mean.
That said, what do you think of the following two lines of reasoning for justifying transforming predictors to correct for skew:
1) If a linear model is applicable, then a skewed predictor will result in a skewed distribution for the response. When we have many predictors, some of which are skewed and some not, then it makes sense to transform both the skewed predictors, and if skewed, the response, so that the true mean function of the (possibly transformed) response is more linear with respect to the (transformed) predictors.
2) If the true mean function of the response is non-linear in a predictor, then applying a variance-correcting transformation, such as the log function, to a predictor makes the distances between values of the predictor more even, which has the effect of stretching out the true mean function of the response where points are dense, hence reducing the curvature of the true mean function in places where the predictor is dense and reducing the influence outliers have on the slope by pulling them in. This improve the fit of the linear model.
1)A skewed predictor will not necessarily result in a skewed distribution for the response. But it is pretty common that a skewed predictor *doesn’t* have a linear relationship with the response. So sometimes doing a log transformation on X solves multiple problems simultaneously. Graphing is your friend.
2) Yes, exactly!
I am running an OLS regression with highly skewed IVs, the residuals are however normal. the IV is based on a dichotomous scale (0 and 1). I know OLS doesn’t require noramlly distributed IVs. I just wanted to know how kurtosis (leptokurtic) and skewness is explained for in OLS panel data setting. I need basic info, I am quite ignorant when it comes to econometrics.
I thought this post was very helpful. I was hoping you can help me with this though; I am looking at a data set and my main independent variable of interest is dichotomous variable and I would like to run a regression analysis on this data. However, I noticed about 80% of the data is in one category and 20% is in the other. Can I still run a regression analysis on this data?
It’s fine, although the power will be limited by the smaller sample size.
Got a different issue, I’m trying to run a regression analysis with EVAtm as one of the independent variables. How ever some of the observations are (typically) negative numbers. how do I deal with that. is there a need to adjust if yes how do I do it
I need a good refrence same as (Sweet, S.A., & Grace-Martin, K. (2011). Data Analysis with SPSS: A First Course in Applied Statistics Plus Mysearchlab with Etext — Access Card Package: Pearson College Division)for my tesis,but i can not have this book, so please send for me some sections of the book that tell us we can use linear regression models for non-normal distributions of independent or dependent variables
Thanks a lot
You’d have to contact the publisher, Pearson–I don’t actually know anything about the E version.
And actually, the only regression models in includes are linear models and logistic for binary response.
I am trying to find a regression model that takes into account the distribution of the independent variables.
My reasoning is that there is not only a distribution on the y-axis, but also on the x’s (uncertainty in the measurement of the xs). Hence the classic regression model doesn’t account for that uncertainty in the xs. Do you know of a model that does?
Ultimately, I would love to be able to calculate the effect of the uncertainty in xs on y.
Yes, there is. It’s called Type II or Major Axis Regression. I helped a client with it years ago for the same reason, but haven’t used it since, so I can’t recommend a resource.
But if you google it, you’ll find plenty of explanations.
many thanks for your reply. It helped me a great deal.
However, I am now trying to figure out how to run the Major Axis Regressions in SAS. Do you have any idea what command to use; I have been unsuccessful at finding it thus far…
Also, I can’t seem to find anything about type II regressions in conventional statistics books. I understand that you do not have a reference book in mind but do you know at least where I could start looking? All I seem to find on Google are articles using this methodology. I would love to have a proper reference for this methodology though…
Hmmm, I know I’ve seen books that include it. I just did a search on Amazon and quite a few books came up. I can’t recommend any b/c I haven’t read them, but for example there is a section in this book on it, according to the Table of Contents: Linear Models and Generalizations: Least Squares and Alternatives, by C. Radhakrishna Rao, Helge Toutenburg.
I would start with an Amazon search, or better yet, if you have a good university library, search there. Good Luck!
I’m probably responding to an old question (note to webmaster: please turn on dates on posts!). The regression Maude may be looking for is a Deming regression and it’s available in R.
Jan 27, 2018
Joanne Lello says
I wonder if the original poster – Karen – could comment on why , if there is no assumption of normality for the independent variable, that I get differences in significance for some of my independent variables if I transform them compared to when they are left highly skewed?
An assumption of normality just means that the p-value you’re getting is calculated based on a normal distribution. So if the data aren’t normal, the p-value you get isn’t right. You could for example, put in the same Xs and Ys and assume a Poisson distribution, and the p-value will differ. Because they’re based on different assumptions.
What you’re doing by transforming X (the independent variable) is really calculating the model with a different independent variable. X is scaled differently. Another assumption is that you have the right independent variables in the model. But you’re still going to base the p-value off the Other assumption of a normal distribution for the residuals.
I think his mode is fine, because he said the count variable is a IV, not the DV.
Stefan Ehrlich says
Thank you for this very helpful information. However, I have a highly right-skewed distribution of one of my independent variables (# of cigarettes smoked per day,most subjects = 0). This seem to influence the distribution of the Residulas of my multiple regression model – they are non-normal as well. Is my model still valid?
Stefan, great question. No, your model isn’t valid as is. Most count variables, like yours, with most values = 0 follow a Poisson distribution, or something in that family. If you fit an ordinary multiple regression model, you are both violating its assumptions and allowing negative predicted values, which clearly aren’t accurate.
You can learn more at: