Logistic regression models can seem pretty overwhelming to the uninitiated. Why not use a regular regression model? Just turn Y into an indicator variable–Y=1 for success and Y=0 for failure.
For some good reasons.
1.It doesn’t make sense to model Y as a linear function of the parameters because Y has only two values. You just can’t make a line out of that (at least not one that fits the data well).
2. The predicted values can be any positive or negative number, not just 0 or 1.
3. The values of 0 and 1 are arbitrary.The important part is not to predict the numerical value of Y, but the probability that success or failure occurs, and the extent to which that probability depends on the predictor variables.
So okay, you say. Why not use a simple transformation of Y, like probability of success–the probability that Y=1.
Well, that doesn’t work so well either.
1. The right hand side of the equation can be any number, but the left hand side can only range from 0 to 1.
2. It turns out the relationship is not linear, but rather follows an S-shaped (or sigmoidal) curve.
To obtain a linear relationship, we need to transform this response too, Pr(success).
As luck would have it, there are a few functions that:
1. are not restricted to values between 0 and 1
2. will form a linear relationship with our parameters
These functions include:
All three of these work just as well, but (believe it or not) the Logit function is the easiest to interpret.
But as it turns out, you can’t just run the transformation then do a regular linear regression on the transformed data. That would be way too easy, but also give inaccurate results. Logistic Regression uses a different method for estimating the parameters, which gives better results–better meaning unbiased, with lower variances.
well. I want to detect the multi collinearity between the independent variable which are categorical( gender,yes/no,low/ high and rikert scale(agree,disagree,newutral,and strongly agree). our dependent variable are binary response. so how can test the multicollinearity and hetroscedasticity
Karen Grace-Martin says
Many statistical software don’t have multicollinearity diagnostics for logistic regression, which you need for the binary response. It’s fine to run the model in a linear regression JUST to get the multicollinearity diagnostics. After all, multicollinearity is about the predictors only, not response.
If I have a binary outcome variable and several Independent variables (all categorical), I want to use a binary logistic regression. Do I still need to check for proportional odds? How do you do that using all categorical variables? Or is proportional odds just for ordinal regression?
Karen Grace-Martin says
Proportional odds is just for ordinal regression. You can have categorical independent variables in an ordinal model and they are still subject to the proportional odds assumption.
Is the use of logistic regression appropriate when you have a binary response variable AND binary predictor variables?
Karen Grace-Martin says
Yes, it’s appropriate. But if that’s your only predictor, it may also be overkill. See this article:
Hi Karen. I just came across this page as I have started my Mphil Biology and as I came in in January, I am thrown into the deep end stats wise. Thank you for this page and I look forward to checking out your webinars.
Sir, I ran Binary logistic for a study but variables in the model are not in the output results. what to do?
Hi Karen, may I know if i use logistric regression for a set of predictor and a set of binary data(1/0), and i get the results, which are the probabilities between 0 and 1, i.e. 0.2,0.4,0.7
Are those values represent the probability to get success/equal to 1?
How can i analyse the results, by deciding what value of probability is equal to 1 as the results would not be exactly one and they are in between of 0 and 1.
Eric Cai says
No, actually – I do mean binary covariate.
In the sample-size calculator for Cox regresion in PASS 12, I wanted to determine the sample size required for detecting a hazard ratio of 2 for a binary covariate. There is an option to include Rsq for the relationship between this covariate and all of the other covariates in the model.
(e.g. I have covariates X1, X2, X3. X1 is my binary covariate of interest. I want to determine the hazard ratio for X1. PASS has an option to enter the Rsq for a model with X1 is the response and X2 & X3 as the covariates.)
I wanted to convince PASS that this would only make sense for a continuous covariate, but not a categorical one.
Got it. That makes sense.
In that context, there does need to be some way to indicate the relationship among predictors in order to assess the power of X1’s effect on Y. I don’t have an answer as to what would be a better one, though, as it’s probably important to keep things relatively simple. Rsq would give you an approximation, if not something precise. I suppose the question is how much precision is needed there.
Have you found that there is a big change in the sample size estimates if that Rsq is off?
Eric Cai says
A loyal reader of my blog, Vi Ly, shared a beautiful example in R that has a strong relationship between a binary response and a set of covariates but a weak/moderate R-squared!
Thanks for your time, and thanks again for this post and your great blog!
Yes, I’d agree, using a linear model and measuring Rsq for it will give you an approximate measure of fit. It may even be reasonable for ranking models (higher Rsq models will have better fit). But I would use caution.
One way I think about it is to actually compute R, the correlation coefficient, between a binary 1/0 variable and a continuous variable. It *will* give you an approximation of the strength of the relationship, but it’s never going to be linear, as the r implies.
One other thing, in that thread, you mention at one point that you’re trying to convince him not to use Rsq for a binary covariate. Was covariate a slip of the keyboard? It is the response, yes?
Eric Cai says
Thanks for this very informative blog post.
I want to show that R-squared (regression sum of squares divided by total sum of squares) is not a good measure of the strength of the relationship between a binary response and a set of predictors. I argue that most of the fitted responses will be far away from the actual responses, even if there is a strong relationship between the predictors and the binary response. Someone disagreed with me on this, and I am seeking other ways to show that he is wrong.
Here is one way: I want to justify my view by finding an example of a strong relationship between a binary response and a set of predictors that has a low R-squared. Can you think of such an example?
Can you think of any other arguments for why R-squared is not a good measure of the fit between a binary response and a set of predictors?
Okay, first I have a few questions. It sounds like you’re running a linear regression as logistic doesn’t have a true R squared or sums of squares. Is that right?
Eric Cai says
Yes, I am considering linear regression for binary responses.
I know that it doesn’t make sense. I’m trying to convince someone that the R-squared from linear regression for binary responses is not a suitable measure of fit, even if it can be computed.
I now realize that it may be wrong to say that the fitted responses will be far away from 0 and 1. Nonetheless, I still think that R-squared may overestimate or underestimate the true strength of the association between a binary response and a set of predictors – I just don’t know how to show this.
i want to predict the socio-demographic factors such as age, gender, education, income, employment, number of children on health insurance purchase. what is the best statistical model will best give me the desired results. thank you
There are so many things to consider in deciding a model. Here are a few things I wrote on it:
Thank you so much for your reply. Would you be kind to explain what you mean by “the value with the mean in the middle or at the end”? Thank you again.
Yes. What I meant is you should put the means for each group and in order, starting with the highest value to the lowest. Sometimes it makes the most sense to compare everything to the highest value (or lowest) or the central value.
I think my question above is related to selecting the reference category. Are there any best practices for selecting which level should be the reference? Thank you.
It sounds like it is. There are different choices. Sometimes one category is a clear group to compare all others to, like the largest, or a control group. Other times it makes sense to just choose the value with the mean in the middle or at one end. It’s whatever helps you interpret the results.
I’m trying a logistic regression model in marketing context. The DV is account “bought a product” or “did not buy the product”. I have several IVs of which one of them is revenue range. When I run the logistic regression, the output table indicates “revenue range” variable is significant overall (p-value = .000) but none of the levels of the variable is significant. I was wondering if there’s an explanation for this behavior. Thank you so much for your help.