Why Logistic Regression for Binary Response?

by Karen Grace-Martin 23 Comments

Logistic regression models can seem pretty overwhelming to the uninitiated. Why not use a regular regression model? Just turn Y into an indicator variable–Y=1 for success and Y=0 for failure.

For some good reasons.

1.It doesn’t make sense to model Y as a linear function of the parameters because Y has only two values. You just can’t make a line out of that (at least not one that fits the data well).

2. The predicted values can be any positive or negative number, not just 0 or 1.

3. The values of 0 and 1 are arbitrary.The important part is not to predict the numerical value of Y, but the probability that success or failure occurs, and the extent to which that probability depends on the predictor variables.

So okay, you say. Why not use a simple transformation of Y, like probability of success–the probability that Y=1.

Well, that doesn’t work so well either.

Why not?

1. The right hand side of the equation can be any number, but the left hand side can only range from 0 to 1.

2. It turns out the relationship is not linear, but rather follows an S-shaped (or sigmoidal) curve.

To obtain a linear relationship, we need to transform this response too, Pr(success).

As luck would have it, there are a few functions that:

1. are not restricted to values between 0 and 1

2. will form a linear relationship with our parameters

These functions include:

•Arcsine

•Probit

•Logit

All three of these work just as well, but (believe it or not) the Logit function is the easiest to interpret.

But as it turns out, you can’t just run the transformation then do a regular linear regression on the transformed data. That would be way too easy, but also give inaccurate results. Logistic Regression uses a different method for estimating the parameters, which gives better results–better meaning unbiased, with lower variances.

Binary, Ordinal, and Multinomial Logistic Regression for Categorical Outcomes

Get beyond the frustration of learning odds ratios, logit link functions, and proportional odds assumptions on your own. See the incredible usefulness of logistic regression and categorical data analysis in this one-hour training.

Comments

td says

May 14, 2021 at 3:01 am

well. I want to detect the multi collinearity between the independent variable which are categorical( gender,yes/no,low/ high and rikert scale(agree,disagree,newutral,and strongly agree). our dependent variable are binary response. so how can test the multicollinearity and hetroscedasticity

Reply
- Karen Grace-Martin says
  
  July 16, 2021 at 11:52 am
  
  Many statistical software don’t have multicollinearity diagnostics for logistic regression, which you need for the binary response. It’s fine to run the model in a linear regression JUST to get the multicollinearity diagnostics. After all, multicollinearity is about the predictors only, not response.
  
  Reply
Emily says

December 2, 2020 at 7:31 pm

If I have a binary outcome variable and several Independent variables (all categorical), I want to use a binary logistic regression. Do I still need to check for proportional odds? How do you do that using all categorical variables? Or is proportional odds just for ordinal regression?
Thank you!!

Reply
- Karen Grace-Martin says
  
  December 8, 2020 at 9:31 am
  
  Hi Emily,
  
  Proportional odds is just for ordinal regression. You can have categorical independent variables in an ordinal model and they are still subject to the proportional odds assumption.
  
  Reply
Natasha says

October 29, 2020 at 1:13 am

Is the use of logistic regression appropriate when you have a binary response variable AND binary predictor variables?

Reply
- Karen Grace-Martin says
  
  November 2, 2020 at 10:04 am
  
  Hi Natasha,
  
  Yes, it’s appropriate. But if that’s your only predictor, it may also be overkill. See this article:
  https://www.theanalysisfactor.com/chi-square-test-vs-logistic-regression-is-a-fancier-test-better/
  
  Reply
Jeniece says

April 25, 2017 at 11:52 am

Hi Karen. I just came across this page as I have started my Mphil Biology and as I came in in January, I am thrown into the deep end stats wise. Thank you for this page and I look forward to checking out your webinars.

Reply
Michal says

September 4, 2016 at 2:21 am

Sir, I ran Binary logistic for a study but variables in the model are not in the output results. what to do?

Reply
CY says

June 14, 2016 at 10:02 pm

Hi Karen, may I know if i use logistric regression for a set of predictor and a set of binary data(1/0), and i get the results, which are the probabilities between 0 and 1, i.e. 0.2,0.4,0.7
Are those values represent the probability to get success/equal to 1?
How can i analyse the results, by deciding what value of probability is equal to 1 as the results would not be exactly one and they are in between of 0 and 1.

Thanks

Reply
Eric Cai says

February 1, 2014 at 10:16 pm

Hi Karen,

No, actually – I do mean binary covariate.

In the sample-size calculator for Cox regresion in PASS 12, I wanted to determine the sample size required for detecting a hazard ratio of 2 for a binary covariate. There is an option to include Rsq for the relationship between this covariate and all of the other covariates in the model.

(e.g. I have covariates X1, X2, X3. X1 is my binary covariate of interest. I want to determine the hazard ratio for X1. PASS has an option to enter the Rsq for a model with X1 is the response and X2 & X3 as the covariates.)

I wanted to convince PASS that this would only make sense for a continuous covariate, but not a categorical one.

Reply
- Karen says
  
  February 3, 2014 at 4:12 pm
  
  Got it. That makes sense.
  
  In that context, there does need to be some way to indicate the relationship among predictors in order to assess the power of X1’s effect on Y. I don’t have an answer as to what would be a better one, though, as it’s probably important to keep things relatively simple. Rsq would give you an approximation, if not something precise. I suppose the question is how much precision is needed there.
  
  Have you found that there is a big change in the sample size estimates if that Rsq is off?
  
  Reply
Eric Cai says

January 23, 2014 at 12:34 pm

Hi Karen,

A loyal reader of my blog, Vi Ly, shared a beautiful example in R that has a strong relationship between a binary response and a set of covariates but a weak/moderate R-squared!

http://linkd.in/1hOEHty

Thanks for your time, and thanks again for this post and your great blog!

Eric

Reply
- Karen says
  
  January 23, 2014 at 1:12 pm
  
  Very nice.
  
  Yes, I’d agree, using a linear model and measuring Rsq for it will give you an approximate measure of fit. It may even be reasonable for ranking models (higher Rsq models will have better fit). But I would use caution.
  
  One way I think about it is to actually compute R, the correlation coefficient, between a binary 1/0 variable and a continuous variable. It *will* give you an approximation of the strength of the relationship, but it’s never going to be linear, as the r implies.
  
  One other thing, in that thread, you mention at one point that you’re trying to convince him not to use Rsq for a binary covariate. Was covariate a slip of the keyboard? It is the response, yes?
  
  Reply
Eric Cai says

January 18, 2014 at 12:21 am

Hi Karen,

Thanks for this very informative blog post.

I want to show that R-squared (regression sum of squares divided by total sum of squares) is not a good measure of the strength of the relationship between a binary response and a set of predictors. I argue that most of the fitted responses will be far away from the actual responses, even if there is a strong relationship between the predictors and the binary response. Someone disagreed with me on this, and I am seeking other ways to show that he is wrong.

Here is one way: I want to justify my view by finding an example of a strong relationship between a binary response and a set of predictors that has a low R-squared. Can you think of such an example?

Can you think of any other arguments for why R-squared is not a good measure of the fit between a binary response and a set of predictors?

Thanks,

Eric

Reply
- Karen says
  
  January 20, 2014 at 11:02 am
  
  Hi Eric,
  
  Okay, first I have a few questions. It sounds like you’re running a linear regression as logistic doesn’t have a true R squared or sums of squares. Is that right?
  
  Reply
  - Eric Cai says
    
    January 22, 2014 at 2:18 am
    
    Yes, I am considering linear regression for binary responses.
    
    I know that it doesn’t make sense. I’m trying to convince someone that the R-squared from linear regression for binary responses is not a suitable measure of fit, even if it can be computed.
    
    I now realize that it may be wrong to say that the fitted responses will be far away from 0 and 1. Nonetheless, I still think that R-squared may overestimate or underestimate the true strength of the association between a binary response and a set of predictors – I just don’t know how to show this.
    
    Reply
thom says

September 27, 2013 at 12:46 am

i want to predict the socio-demographic factors such as age, gender, education, income, employment, number of children on health insurance purchase. what is the best statistical model will best give me the desired results. thank you

Reply
- Karen says
  
  September 30, 2013 at 3:28 pm
  
  Hi Thom,
  
  There are so many things to consider in deciding a model. Here are a few things I wrote on it:
  https://www.theanalysisfactor.com/8-things-to-consider-in-choosing-statistical-analysis/
  https://www.theanalysisfactor.com/statistical-analysis-planning-strategies/
  
  Reply
Baris says

June 21, 2012 at 11:48 am

Hi Karen,

Thank you so much for your reply. Would you be kind to explain what you mean by “the value with the mean in the middle or at the end”? Thank you again.

Best,

Reply
- Karen says
  
  June 25, 2012 at 12:01 pm
  
  Hi Baris,
  
  Yes. What I meant is you should put the means for each group and in order, starting with the highest value to the lowest. Sometimes it makes the most sense to compare everything to the highest value (or lowest) or the central value.
  
  Karen
  
  Reply
Baris says

June 19, 2012 at 5:35 pm

Hi again,

I think my question above is related to selecting the reference category. Are there any best practices for selecting which level should be the reference? Thank you.

Best,

Reply
- Karen says
  
  June 19, 2012 at 7:09 pm
  
  Hi Baris,
  
  It sounds like it is. There are different choices. Sometimes one category is a clear group to compare all others to, like the largest, or a control group. Other times it makes sense to just choose the value with the mean in the middle or at one end. It’s whatever helps you interpret the results.
  
  Karen
  
  Reply
Baris says

June 18, 2012 at 4:20 pm

Dear Karen,

I’m trying a logistic regression model in marketing context. The DV is account “bought a product” or “did not buy the product”. I have several IVs of which one of them is revenue range. When I run the logistic regression, the output table indicates “revenue range” variable is significant overall (p-value = .000) but none of the levels of the variable is significant. I was wondering if there’s an explanation for this behavior. Thank you so much for your help.
Best,

Reply

Reader Interactions

Comments

Leave a Reply Cancel reply