Logistic regression models can seem pretty overwhelming to the uninitiated. Why not use a regular regression model? Just turn Y into an indicator variable–Y=1 for success and Y=0 for failure.

For some good reasons.

1.It doesn’t make sense to model Y as a linear function of the parameters because Y has only two values. You just can’t make a line out of that (at least not one that fits the data well).

2. The predicted values can be any positive or negative number, not just 0 or 1.

3. The values of 0 and 1 are arbitrary.The important part is not to predict the numerical value of Y, but the probability that success or failure occurs, and the extent to which that probability depends on the predictor variables.

So okay, you say. Why not use a simple transformation of Y, like probability of success–the probability that Y=1.

Well, that doesn’t work so well either.

Why not?

1. The right hand side of the equation can be any number, but the left hand side can only range from 0 to 1.

2. It turns out the relationship is not linear, but rather follows an S-shaped (or sigmoidal) curve.

To obtain a linear relationship, we need to transform this response too, Pr(success).

As luck would have it, there are a few functions that:

1. are not restricted to values between 0 and 1

2. will form a linear relationship with our parameters

These functions include:

•Arcsine

All three of these work just as well, but (believe it or not) the Logit function is the easiest to interpret.

But as it turns out, you can’t just run the transformation then do a regular linear regression on the transformed data. That would be way too easy, but also give inaccurate results. Logistic Regression uses a different method for estimating the parameters, which gives better results–better meaning unbiased, with lower variances.

Jeniece says

Hi Karen. I just came across this page as I have started my Mphil Biology and as I came in in January, I am thrown into the deep end stats wise. Thank you for this page and I look forward to checking out your webinars.

Michal says

Sir, I ran Binary logistic for a study but variables in the model are not in the output results. what to do?

CY says

Hi Karen, may I know if i use logistric regression for a set of predictor and a set of binary data(1/0), and i get the results, which are the probabilities between 0 and 1, i.e. 0.2,0.4,0.7

Are those values represent the probability to get success/equal to 1?

How can i analyse the results, by deciding what value of probability is equal to 1 as the results would not be exactly one and they are in between of 0 and 1.

Thanks

Eric Cai says

Hi Karen,

No, actually – I do mean binary covariate.

In the sample-size calculator for Cox regresion in PASS 12, I wanted to determine the sample size required for detecting a hazard ratio of 2 for a binary covariate. There is an option to include Rsq for the relationship between this covariate and all of the other covariates in the model.

(e.g. I have covariates X1, X2, X3. X1 is my binary covariate of interest. I want to determine the hazard ratio for X1. PASS has an option to enter the Rsq for a model with X1 is the response and X2 & X3 as the covariates.)

I wanted to convince PASS that this would only make sense for a continuous covariate, but not a categorical one.

Karen says

Got it. That makes sense.

In that context, there does need to be some way to indicate the relationship among predictors in order to assess the power of X1’s effect on Y. I don’t have an answer as to what would be a better one, though, as it’s probably important to keep things relatively simple. Rsq would give you an approximation, if not something precise. I suppose the question is how much precision is needed there.

Have you found that there is a big change in the sample size estimates if that Rsq is off?

Eric Cai says

Hi Karen,

A loyal reader of my blog, Vi Ly, shared a beautiful example in R that has a strong relationship between a binary response and a set of covariates but a weak/moderate R-squared!

http://linkd.in/1hOEHty

Thanks for your time, and thanks again for this post and your great blog!

Eric

Karen says

Very nice.

Yes, I’d agree, using a linear model and measuring Rsq for it will give you an approximate measure of fit. It may even be reasonable for ranking models (higher Rsq models will have better fit). But I would use caution.

One way I think about it is to actually compute R, the correlation coefficient, between a binary 1/0 variable and a continuous variable. It *will* give you an approximation of the strength of the relationship, but it’s never going to be linear, as the r implies.

One other thing, in that thread, you mention at one point that you’re trying to convince him not to use Rsq for a binary covariate. Was covariate a slip of the keyboard? It is the response, yes?

Eric Cai says

Hi Karen,

Thanks for this very informative blog post.

I want to show that R-squared (regression sum of squares divided by total sum of squares) is not a good measure of the strength of the relationship between a binary response and a set of predictors. I argue that most of the fitted responses will be far away from the actual responses, even if there is a strong relationship between the predictors and the binary response. Someone disagreed with me on this, and I am seeking other ways to show that he is wrong.

Here is one way: I want to justify my view by finding an example of a strong relationship between a binary response and a set of predictors that has a low R-squared. Can you think of such an example?

Can you think of any other arguments for why R-squared is not a good measure of the fit between a binary response and a set of predictors?

Thanks,

Eric

Karen says

Hi Eric,

Okay, first I have a few questions. It sounds like you’re running a linear regression as logistic doesn’t have a true R squared or sums of squares. Is that right?

Eric Cai says

Yes, I am considering linear regression for binary responses.

I know that it doesn’t make sense. I’m trying to convince someone that the R-squared from linear regression for binary responses is not a suitable measure of fit, even if it can be computed.

I now realize that it may be wrong to say that the fitted responses will be far away from 0 and 1. Nonetheless, I still think that R-squared may overestimate or underestimate the true strength of the association between a binary response and a set of predictors – I just don’t know how to show this.

thom says

i want to predict the socio-demographic factors such as age, gender, education, income, employment, number of children on health insurance purchase. what is the best statistical model will best give me the desired results. thank you

Karen says

Hi Thom,

There are so many things to consider in deciding a model. Here are a few things I wrote on it:

https://www.theanalysisfactor.com/8-things-to-consider-in-choosing-statistical-analysis/

https://www.theanalysisfactor.com/statistical-analysis-planning-strategies/

Baris says

Hi Karen,

Thank you so much for your reply. Would you be kind to explain what you mean by “the value with the mean in the middle or at the end”? Thank you again.

Best,

Karen says

Hi Baris,

Yes. What I meant is you should put the means for each group and in order, starting with the highest value to the lowest. Sometimes it makes the most sense to compare everything to the highest value (or lowest) or the central value.

Karen

Baris says

Hi again,

I think my question above is related to selecting the reference category. Are there any best practices for selecting which level should be the reference? Thank you.

Best,

Karen says

Hi Baris,

It sounds like it is. There are different choices. Sometimes one category is a clear group to compare all others to, like the largest, or a control group. Other times it makes sense to just choose the value with the mean in the middle or at one end. It’s whatever helps you interpret the results.

Karen

Baris says

Dear Karen,

I’m trying a logistic regression model in marketing context. The DV is account “bought a product” or “did not buy the product”. I have several IVs of which one of them is revenue range. When I run the logistic regression, the output table indicates “revenue range” variable is significant overall (p-value = .000) but none of the levels of the variable is significant. I was wondering if there’s an explanation for this behavior. Thank you so much for your help.

Best,