• Skip to primary navigation
  • Skip to main content
  • Skip to primary sidebar
The Analysis Factor

The Analysis Factor

Statistical Consulting, Resources, and Statistics Workshops for Researchers

  • Home
  • About
    • Our Programs
    • Our Team
    • Our Core Values
    • Our Privacy Policy
    • Employment
    • Guest Instructors
  • Membership
    • Statistically Speaking Membership Program
    • Login
  • Workshops
    • Online Workshops
    • Login
  • Consulting
    • Statistical Consulting Services
    • Login
  • Free Webinars
  • Contact
  • Login

Why Logistic Regression for Binary Response?

by Karen Grace-Martin 22 Comments

Logistic regression models can seem pretty overwhelming to the uninitiated.  Why not use a regular regression model?  Just turn Y into an indicator variable–Y=1 for success and Y=0 for failure.

For some good reasons.

1.It doesn’t make sense to model Y as a linear function of the parameters because Y has only two values.  You just can’t make a line out of that (at least not one that fits the data well).

2. The predicted values can be any positive or negative number, not just 0 or 1.

3. The values of 0 and 1 are arbitrary.The important part is not to predict the numerical value of Y, but the probability that success or failure occurs, and the extent to which that probability depends on the predictor variables.

So okay, you say.  Why not use a simple transformation of Y, like probability of success–the probability that Y=1.

Well, that doesn’t work so well either.

Why not?

1. The right hand side of the equation can be any number, but the left hand side can only range from 0 to 1.

2. It turns out the relationship is not linear, but rather follows an S-shaped (or sigmoidal) curve.

To obtain a linear relationship, we need to transform this response too, Pr(success).

As luck would have it, there are a few functions that:

1. are not restricted to values between 0 and 1

2. will form a linear relationship with our parameters

These functions include:

•Arcsine

•Probit

•Logit

All three of these work just as well, but (believe it or not) the Logit function is the easiest to interpret.

But as it turns out, you can’t just run the transformation then do a regular linear regression on the transformed data.  That would be way too easy, but also give inaccurate results.  Logistic Regression uses a different method for estimating the parameters, which gives better results–better meaning unbiased, with lower variances.

Bookmark and Share

Binary, Ordinal, and Multinomial Logistic Regression for Categorical Outcomes
Get beyond the frustration of learning odds ratios, logit link functions, and proportional odds assumptions on your own. See the incredible usefulness of logistic regression and categorical data analysis in this one-hour training.

Tagged With: arcsine transformation, binary variable, logistic regression, logit transformation, odds ratio

Related Posts

  • Introduction to Logistic Regression
  • Member Training: Explaining Logistic Regression Results to Non-Researchers
  • How to Decide Between Multinomial and Ordinal Logistic Regression Models
  • When to Use Logistic Regression for Percentages and Counts

Reader Interactions

Comments

  1. Emily says

    December 2, 2020 at 7:31 pm

    If I have a binary outcome variable and several Independent variables (all categorical), I want to use a binary logistic regression. Do I still need to check for proportional odds? How do you do that using all categorical variables? Or is proportional odds just for ordinal regression?
    Thank you!!

    Reply
    • Karen Grace-Martin says

      December 8, 2020 at 9:31 am

      Hi Emily,

      Proportional odds is just for ordinal regression. You can have categorical independent variables in an ordinal model and they are still subject to the proportional odds assumption.

      Reply
  2. Natasha says

    October 29, 2020 at 1:13 am

    Is the use of logistic regression appropriate when you have a binary response variable AND binary predictor variables?

    Reply
    • Karen Grace-Martin says

      November 2, 2020 at 10:04 am

      Hi Natasha,

      Yes, it’s appropriate. But if that’s your only predictor, it may also be overkill. See this article:
      https://www.theanalysisfactor.com/chi-square-test-vs-logistic-regression-is-a-fancier-test-better/

      Reply
  3. Jeniece says

    April 25, 2017 at 11:52 am

    Hi Karen. I just came across this page as I have started my Mphil Biology and as I came in in January, I am thrown into the deep end stats wise. Thank you for this page and I look forward to checking out your webinars.

    Reply
  4. Michal says

    September 4, 2016 at 2:21 am

    Sir, I ran Binary logistic for a study but variables in the model are not in the output results. what to do?

    Reply
  5. CY says

    June 14, 2016 at 10:02 pm

    Hi Karen, may I know if i use logistric regression for a set of predictor and a set of binary data(1/0), and i get the results, which are the probabilities between 0 and 1, i.e. 0.2,0.4,0.7
    Are those values represent the probability to get success/equal to 1?
    How can i analyse the results, by deciding what value of probability is equal to 1 as the results would not be exactly one and they are in between of 0 and 1.

    Thanks

    Reply
  6. Eric Cai says

    February 1, 2014 at 10:16 pm

    Hi Karen,

    No, actually – I do mean binary covariate.

    In the sample-size calculator for Cox regresion in PASS 12, I wanted to determine the sample size required for detecting a hazard ratio of 2 for a binary covariate. There is an option to include Rsq for the relationship between this covariate and all of the other covariates in the model.

    (e.g. I have covariates X1, X2, X3. X1 is my binary covariate of interest. I want to determine the hazard ratio for X1. PASS has an option to enter the Rsq for a model with X1 is the response and X2 & X3 as the covariates.)

    I wanted to convince PASS that this would only make sense for a continuous covariate, but not a categorical one.

    Reply
    • Karen says

      February 3, 2014 at 4:12 pm

      Got it. That makes sense.

      In that context, there does need to be some way to indicate the relationship among predictors in order to assess the power of X1’s effect on Y. I don’t have an answer as to what would be a better one, though, as it’s probably important to keep things relatively simple. Rsq would give you an approximation, if not something precise. I suppose the question is how much precision is needed there.

      Have you found that there is a big change in the sample size estimates if that Rsq is off?

      Reply
  7. Eric Cai says

    January 23, 2014 at 12:34 pm

    Hi Karen,

    A loyal reader of my blog, Vi Ly, shared a beautiful example in R that has a strong relationship between a binary response and a set of covariates but a weak/moderate R-squared!

    http://linkd.in/1hOEHty

    Thanks for your time, and thanks again for this post and your great blog!

    Eric

    Reply
    • Karen says

      January 23, 2014 at 1:12 pm

      Very nice.

      Yes, I’d agree, using a linear model and measuring Rsq for it will give you an approximate measure of fit. It may even be reasonable for ranking models (higher Rsq models will have better fit). But I would use caution.

      One way I think about it is to actually compute R, the correlation coefficient, between a binary 1/0 variable and a continuous variable. It *will* give you an approximation of the strength of the relationship, but it’s never going to be linear, as the r implies.

      One other thing, in that thread, you mention at one point that you’re trying to convince him not to use Rsq for a binary covariate. Was covariate a slip of the keyboard? It is the response, yes?

      Reply
  8. Eric Cai says

    January 18, 2014 at 12:21 am

    Hi Karen,

    Thanks for this very informative blog post.

    I want to show that R-squared (regression sum of squares divided by total sum of squares) is not a good measure of the strength of the relationship between a binary response and a set of predictors. I argue that most of the fitted responses will be far away from the actual responses, even if there is a strong relationship between the predictors and the binary response. Someone disagreed with me on this, and I am seeking other ways to show that he is wrong.

    Here is one way: I want to justify my view by finding an example of a strong relationship between a binary response and a set of predictors that has a low R-squared. Can you think of such an example?

    Can you think of any other arguments for why R-squared is not a good measure of the fit between a binary response and a set of predictors?

    Thanks,

    Eric

    Reply
    • Karen says

      January 20, 2014 at 11:02 am

      Hi Eric,

      Okay, first I have a few questions. It sounds like you’re running a linear regression as logistic doesn’t have a true R squared or sums of squares. Is that right?

      Reply
      • Eric Cai says

        January 22, 2014 at 2:18 am

        Yes, I am considering linear regression for binary responses.

        I know that it doesn’t make sense. I’m trying to convince someone that the R-squared from linear regression for binary responses is not a suitable measure of fit, even if it can be computed.

        I now realize that it may be wrong to say that the fitted responses will be far away from 0 and 1. Nonetheless, I still think that R-squared may overestimate or underestimate the true strength of the association between a binary response and a set of predictors – I just don’t know how to show this.

        Reply
  9. thom says

    September 27, 2013 at 12:46 am

    i want to predict the socio-demographic factors such as age, gender, education, income, employment, number of children on health insurance purchase. what is the best statistical model will best give me the desired results. thank you

    Reply
    • Karen says

      September 30, 2013 at 3:28 pm

      Hi Thom,

      There are so many things to consider in deciding a model. Here are a few things I wrote on it:
      https://www.theanalysisfactor.com/8-things-to-consider-in-choosing-statistical-analysis/
      https://www.theanalysisfactor.com/statistical-analysis-planning-strategies/

      Reply
  10. Baris says

    June 21, 2012 at 11:48 am

    Hi Karen,

    Thank you so much for your reply. Would you be kind to explain what you mean by “the value with the mean in the middle or at the end”? Thank you again.

    Best,

    Reply
    • Karen says

      June 25, 2012 at 12:01 pm

      Hi Baris,

      Yes. What I meant is you should put the means for each group and in order, starting with the highest value to the lowest. Sometimes it makes the most sense to compare everything to the highest value (or lowest) or the central value.

      Karen

      Reply
  11. Baris says

    June 19, 2012 at 5:35 pm

    Hi again,

    I think my question above is related to selecting the reference category. Are there any best practices for selecting which level should be the reference? Thank you.

    Best,

    Reply
    • Karen says

      June 19, 2012 at 7:09 pm

      Hi Baris,

      It sounds like it is. There are different choices. Sometimes one category is a clear group to compare all others to, like the largest, or a control group. Other times it makes sense to just choose the value with the mean in the middle or at one end. It’s whatever helps you interpret the results.

      Karen

      Reply
  12. Baris says

    June 18, 2012 at 4:20 pm

    Dear Karen,

    I’m trying a logistic regression model in marketing context. The DV is account “bought a product” or “did not buy the product”. I have several IVs of which one of them is revenue range. When I run the logistic regression, the output table indicates “revenue range” variable is significant overall (p-value = .000) but none of the levels of the variable is significant. I was wondering if there’s an explanation for this behavior. Thank you so much for your help.
    Best,

    Reply

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Please note that, due to the large number of comments submitted, any questions on problems related to a personal study/project will not be answered. We suggest joining Statistically Speaking, where you have access to a private forum and more resources 24/7.

Primary Sidebar

This Month’s Statistically Speaking Live Training

  • February Member Training: Choosing the Best Statistical Analysis

Upcoming Workshops

  • Logistic Regression for Binary, Ordinal, and Multinomial Outcomes (May 2021)
  • Introduction to Generalized Linear Mixed Models (May 2021)

Read Our Book



Data Analysis with SPSS
(4th Edition)

by Stephen Sweet and
Karen Grace-Martin

Statistical Resources by Topic

  • Fundamental Statistics
  • Effect Size Statistics, Power, and Sample Size Calculations
  • Analysis of Variance and Covariance
  • Linear Regression
  • Complex Surveys & Sampling
  • Count Regression Models
  • Logistic Regression
  • Missing Data
  • Mixed and Multilevel Models
  • Principal Component Analysis and Factor Analysis
  • Structural Equation Modeling
  • Survival Analysis and Event History Analysis
  • Data Analysis Practice and Skills
  • R
  • SPSS
  • Stata

Copyright © 2008–2021 The Analysis Factor, LLC. All rights reserved.
877-272-8096   Contact Us

The Analysis Factor uses cookies to ensure that we give you the best experience of our website. If you continue we assume that you consent to receive cookies on all websites from The Analysis Factor.
Continue Privacy Policy
Privacy & Cookies Policy

Privacy Overview

This website uses cookies to improve your experience while you navigate through the website. Out of these, the cookies that are categorized as necessary are stored on your browser as they are essential for the working of basic functionalities of the website. We also use third-party cookies that help us analyze and understand how you use this website. These cookies will be stored in your browser only with your consent. You also have the option to opt-out of these cookies. But opting out of some of these cookies may affect your browsing experience.
Necessary
Always Enabled

Necessary cookies are absolutely essential for the website to function properly. This category only includes cookies that ensures basic functionalities and security features of the website. These cookies do not store any personal information.

Non-necessary

Any cookies that may not be particularly necessary for the website to function and is used specifically to collect user personal data via analytics, ads, other embedded contents are termed as non-necessary cookies. It is mandatory to procure user consent prior to running these cookies on your website.

SAVE & ACCEPT