• Skip to primary navigation
  • Skip to main content
  • Skip to primary sidebar
The Analysis Factor

The Analysis Factor

Statistical Consulting, Resources, and Statistics Workshops for Researchers

  • Home
  • Our Programs
    • Membership
    • Online Workshops
    • Free Webinars
    • Consulting Services
  • About
    • Our Team
    • Our Core Values
    • Our Privacy Policy
    • Employment
    • Collaborate with Us
  • Statistical Resources
  • Contact
  • Blog
  • Login

The Distribution of Independent Variables in Regression Models

by Karen Grace-Martin 27 Comments

I often hear concern about the non-normal distributions of independent variables in regression models, and I am here to ease your mind.

There are NO assumptions in any linear model about the distribution of the independent variables.  Yes, you only get meaningful parameter estimates from nominal (unordered categories) or numerical (continuous or discrete) independent variables.  But no, the model makes no assumptions about them.  They do not need to be normally distributed or continuous.

It is useful, however, to understand the distribution of predictor variables to find influential outliers or concentrated values.  A highly skewed independent variable may be made more symmetric with a transformation.

Four Critical Steps in Building Linear Regression Models
While you’re worrying about which predictors to enter, you might be missing issues that have a big impact your analysis. This training will help you achieve more accurate results and a less-frustrating model building experience.

Tagged With: checking assumptions, distribution, independent variable, normality, predictor variable, regression models

Related Posts

  • The Distribution of Independent Variables in Regression Models
  • Eight Ways to Detect Multicollinearity
  • Likert Scale Items as Predictor Variables in Regression
  • Why report estimated marginal means?

Reader Interactions

Comments

  1. Ariel Balter says

    September 2, 2019 at 9:21 pm

    @Stefan Ehrlich this is a classic case of zero-inflated data. You will only have non-zero data for number of cigarettes smoked per day for smokers. So you have something like P[smoke N cigs / day] = I*D[smoke N cigs / day] where D is the distribution AMONG SMOKERS and I is an indicator variable for whether or not the subject is a smoker. There is tons of information out there about zero-inflated data and appropriate analysis methods.

    Reply
  2. Ariel Balter says

    September 2, 2019 at 9:18 pm

    Does it make any sense at all to speak of the distribution of a predictor? In the context of regression, independent variables are not considered random. For instance, suppose you have data for blood pressure vs age, and you happen to have some distribution of ages that were surveyed, does it make sense to think of age as a random variable with a distribution?

    Reply
    • Karen Grace-Martin says

      September 3, 2019 at 10:03 am

      Hi Ariel,

      There are situations where it makes sense to think of a predictor as random. This is called Type II regression or Major Axis regression.

      Reply
      • Chao Yue says

        May 20, 2020 at 10:21 am

        I think Ariel Balter’s point is that to talk about the *statistical* distribution of the independent variables is actually inappropriate, because after all in the framework of linear regressions (if we limit to the OLS method) the independent variables are not considered as random but rather fixed values, while the linear model is doing the job to predict the conditional distribution of the response variable given a combination of independent variable values. I did not realize this until I have this question. Hope I made my point clear because english is not my native language …

        Reply
        • Karen Grace-Martin says

          June 2, 2020 at 4:04 pm

          Hi Chao,

          Yes, very clear. I agree with Ariel and that was my point in the original article. Xs are not random variable in regression models, so therefore we don’t consider their statistical distribution. But in Type II regression (also called Major Axis Regression) both X and Y are assumed to be random variables, so it makes sense to look at X’s distribution.

          Reply
  3. Tyan says

    September 19, 2017 at 11:43 am

    hi Karen, I need a book that explains that the independent variable does not need to be normally distributed on the regression analysis. can you give the title of the book?
    Thanks

    Reply
    • Karen says

      September 21, 2017 at 4:40 pm

      Hi Tyan,

      I don’t know that you’ll ever find that statement in a book. It’s one of those things where it’s just absent. All books about regression state the assumption of normality as Y|X or the errors. But as for X’s distribution, you might find something that says that X is not a random variable and is fixed. But that’s it.

      Reply
      • Sebastiaan says

        September 29, 2018 at 10:13 am

        If Tyan is really looking for an explicit mention of the fact that no assumptions are made about the distribution of independent variables other than independence between the independent variables and the error, then they can find one in Fox, J (2016). Applied Regression Analysis & Generalized Linear Models (3rd edition) on page 318:

        “(…) the general linear model, which makes no distributional assumptions about the Xs, other than independence between the Xs and the errors.”

        This is mentioned in the context of discrete explanatory variables.

        Reply
  4. Andy says

    August 6, 2016 at 3:02 am

    Karen, I’m really glad to hear that there is no assumption for independent data to be normally distributed; however, mine are seriously skewed: they’re ratios, with many values close to 1 and just a few above 10. The residual plots against the dependents look pretty clumped too. Logging or square rooting doesn’t help. However, my GLM seems happy and the results make sense – does this mean I can trust them?

    Reply
  5. Mike says

    September 17, 2015 at 11:50 am

    Thanks for this helpful post. I came across a text (Applied Predictive Analytics by Dean Abbott) stating that it is useful to correct any skewness in the predictor variables, and this article helped remind me of the assumptions of a linear model. The reasoning cited in the text is that tails of predictor variables have a disproportionate impact on the slope of the line, which I don’t think is appropriate as if the model is truly linear, then outliers are very helpful in reducing uncertainty in the regression line, which is demonstrated in formula of variance for the slope coefficient, which is inversely proportional to the squared distance of the predictor from its mean.

    That said, what do you think of the following two lines of reasoning for justifying transforming predictors to correct for skew:
    1) If a linear model is applicable, then a skewed predictor will result in a skewed distribution for the response. When we have many predictors, some of which are skewed and some not, then it makes sense to transform both the skewed predictors, and if skewed, the response, so that the true mean function of the (possibly transformed) response is more linear with respect to the (transformed) predictors.
    2) If the true mean function of the response is non-linear in a predictor, then applying a variance-correcting transformation, such as the log function, to a predictor makes the distances between values of the predictor more even, which has the effect of stretching out the true mean function of the response where points are dense, hence reducing the curvature of the true mean function in places where the predictor is dense and reducing the influence outliers have on the slope by pulling them in. This improve the fit of the linear model.

    Reply
    • Karen says

      January 30, 2018 at 10:30 am

      Hi Mike,

      1)A skewed predictor will not necessarily result in a skewed distribution for the response. But it is pretty common that a skewed predictor *doesn’t* have a linear relationship with the response. So sometimes doing a log transformation on X solves multiple problems simultaneously. Graphing is your friend.
      2) Yes, exactly!

      Reply
  6. Mamoona says

    February 14, 2015 at 10:11 am

    Hi Karen,

    I am running an OLS regression with highly skewed IVs, the residuals are however normal. the IV is based on a dichotomous scale (0 and 1). I know OLS doesn’t require noramlly distributed IVs. I just wanted to know how kurtosis (leptokurtic) and skewness is explained for in OLS panel data setting. I need basic info, I am quite ignorant when it comes to econometrics.

    Thanks

    Reply
  7. Mary says

    September 28, 2014 at 4:40 pm

    Hi,

    I thought this post was very helpful. I was hoping you can help me with this though; I am looking at a data set and my main independent variable of interest is dichotomous variable and I would like to run a regression analysis on this data. However, I noticed about 80% of the data is in one category and 20% is in the other. Can I still run a regression analysis on this data?

    Thank you

    Reply
    • Karen says

      October 20, 2014 at 9:32 am

      Hi Mary,

      It’s fine, although the power will be limited by the smaller sample size.

      Reply
  8. Jacques says

    July 23, 2014 at 9:28 am

    Hi
    Got a different issue, I’m trying to run a regression analysis with EVAtm as one of the independent variables. How ever some of the observations are (typically) negative numbers. how do I deal with that. is there a need to adjust if yes how do I do it

    Reply
  9. zahra says

    August 28, 2013 at 3:08 pm

    Hi,
    I need a good refrence same as (Sweet, S.A., & Grace-Martin, K. (2011). Data Analysis with SPSS: A First Course in Applied Statistics Plus Mysearchlab with Etext — Access Card Package: Pearson College Division)for my tesis,but i can not have this book, so please send for me some sections of the book that tell us we can use linear regression models for non-normal distributions of independent or dependent variables
    Thanks a lot

    Reply
    • Karen says

      September 25, 2013 at 10:58 am

      Hi Zahra,

      You’d have to contact the publisher, Pearson–I don’t actually know anything about the E version.

      And actually, the only regression models in includes are linear models and logistic for binary response.

      Reply
  10. Maude says

    June 28, 2012 at 10:53 pm

    Hi Karen,
    I am trying to find a regression model that takes into account the distribution of the independent variables.
    My reasoning is that there is not only a distribution on the y-axis, but also on the x’s (uncertainty in the measurement of the xs). Hence the classic regression model doesn’t account for that uncertainty in the xs. Do you know of a model that does?
    Ultimately, I would love to be able to calculate the effect of the uncertainty in xs on y.
    Thanks.

    Reply
    • Karen says

      July 2, 2012 at 9:09 am

      Hi Maude,

      Yes, there is. It’s called Type II or Major Axis Regression. I helped a client with it years ago for the same reason, but haven’t used it since, so I can’t recommend a resource.

      But if you google it, you’ll find plenty of explanations.

      Karen

      Reply
      • Maude says

        August 14, 2012 at 8:16 pm

        Dear Karen,

        many thanks for your reply. It helped me a great deal.
        However, I am now trying to figure out how to run the Major Axis Regressions in SAS. Do you have any idea what command to use; I have been unsuccessful at finding it thus far…

        Also, I can’t seem to find anything about type II regressions in conventional statistics books. I understand that you do not have a reference book in mind but do you know at least where I could start looking? All I seem to find on Google are articles using this methodology. I would love to have a proper reference for this methodology though…

        Thanks again!
        Best, Maude

        Reply
        • Karen says

          September 11, 2012 at 4:59 pm

          Hmmm, I know I’ve seen books that include it. I just did a search on Amazon and quite a few books came up. I can’t recommend any b/c I haven’t read them, but for example there is a section in this book on it, according to the Table of Contents: Linear Models and Generalizations: Least Squares and Alternatives, by C. Radhakrishna Rao, Helge Toutenburg.

          I would start with an Amazon search, or better yet, if you have a good university library, search there. Good Luck!

          Reply
    • Tony says

      January 27, 2018 at 7:04 pm

      I’m probably responding to an old question (note to webmaster: please turn on dates on posts!). The regression Maude may be looking for is a Deming regression and it’s available in R.

      Jan 27, 2018

      Reply
  11. Joanne Lello says

    February 22, 2012 at 1:44 pm

    I wonder if the original poster – Karen – could comment on why , if there is no assumption of normality for the independent variable, that I get differences in significance for some of my independent variables if I transform them compared to when they are left highly skewed?

    Reply
    • Karen says

      February 24, 2012 at 8:00 pm

      Hi Joanne,

      An assumption of normality just means that the p-value you’re getting is calculated based on a normal distribution. So if the data aren’t normal, the p-value you get isn’t right. You could for example, put in the same Xs and Ys and assume a Poisson distribution, and the p-value will differ. Because they’re based on different assumptions.

      What you’re doing by transforming X (the independent variable) is really calculating the model with a different independent variable. X is scaled differently. Another assumption is that you have the right independent variables in the model. But you’re still going to base the p-value off the Other assumption of a normal distribution for the residuals.

      Karen

      Reply
  12. Joy says

    March 5, 2010 at 12:42 pm

    admin,
    I think his mode is fine, because he said the count variable is a IV, not the DV.

    Reply
  13. Stefan Ehrlich says

    July 13, 2009 at 2:49 pm

    Thank you for this very helpful information. However, I have a highly right-skewed distribution of one of my independent variables (# of cigarettes smoked per day,most subjects = 0). This seem to influence the distribution of the Residulas of my multiple regression model – they are non-normal as well. Is my model still valid?

    Reply
    • admin says

      July 13, 2009 at 9:55 pm

      Stefan, great question. No, your model isn’t valid as is. Most count variables, like yours, with most values = 0 follow a Poisson distribution, or something in that family. If you fit an ordinary multiple regression model, you are both violating its assumptions and allowing negative predicted values, which clearly aren’t accurate.

      You can learn more at:

      https://www.theanalysisfactor.com/?p=198
      https://www.theanalysisfactor.com/learning/past-teleseminars.html

      Reply

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Please note that, due to the large number of comments submitted, any questions on problems related to a personal study/project will not be answered. We suggest joining Statistically Speaking, where you have access to a private forum and more resources 24/7.

Primary Sidebar

This Month’s Statistically Speaking Live Training

  • Member Training: Analyzing Pre-Post Data

Upcoming Free Webinars

Poisson and Negative Binomial Regression Models for Count Data

Upcoming Workshops

  • Analyzing Count Data: Poisson, Negative Binomial, and Other Essential Models (Jul 2022)
  • Introduction to Generalized Linear Mixed Models (Jul 2022)

Copyright © 2008–2022 The Analysis Factor, LLC. All rights reserved.
877-272-8096   Contact Us

The Analysis Factor uses cookies to ensure that we give you the best experience of our website. If you continue we assume that you consent to receive cookies on all websites from The Analysis Factor.
Continue Privacy Policy
Privacy & Cookies Policy

Privacy Overview

This website uses cookies to improve your experience while you navigate through the website. Out of these, the cookies that are categorized as necessary are stored on your browser as they are essential for the working of basic functionalities of the website. We also use third-party cookies that help us analyze and understand how you use this website. These cookies will be stored in your browser only with your consent. You also have the option to opt-out of these cookies. But opting out of some of these cookies may affect your browsing experience.
Necessary
Always Enabled
Necessary cookies are absolutely essential for the website to function properly. This category only includes cookies that ensures basic functionalities and security features of the website. These cookies do not store any personal information.
Non-necessary
Any cookies that may not be particularly necessary for the website to function and is used specifically to collect user personal data via analytics, ads, other embedded contents are termed as non-necessary cookies. It is mandatory to procure user consent prior to running these cookies on your website.
SAVE & ACCEPT