Logistic regression models can seem pretty overwhelming to the uninitiated. Why not use a regular regression model? Just turn Y into an indicator variable–Y=1 for success and Y=0 for failure.
For some good reasons.
1.It doesn’t make sense to model Y as a linear function of the parameters because Y has only two values. You just can’t make a line out of that (at least not one that fits the data well).
2. The predicted values can be any positive or negative number, not just 0 or 1.
3. The values of 0 and 1 are arbitrary.The important part is not to predict the numerical value of Y, but the probability that success or failure occurs, and the extent to which that probability depends on the predictor variables.
So okay, you say. Why not use a simple transformation of Y, like probability of success–the probability that Y=1.
Well, that doesn’t work so well either.
Why not?
1. The right hand side of the equation can be any number, but the left hand side can only range from 0 to 1.
2. It turns out the relationship is not linear, but rather follows an S-shaped (or sigmoidal) curve.
To obtain a linear relationship, we need to transform this response too, Pr(success).
As luck would have it, there are a few functions that:
1. are not restricted to values between 0 and 1
2. will form a linear relationship with our parameters
These functions include:
•Arcsine
•Probit
•Logit
All three of these work just as well, but (believe it or not) the Logit function is the easiest to interpret.
But as it turns out, you can’t just run the transformation then do a regular linear regression on the transformed data. That would be way too easy, but also give inaccurate results. Logistic Regression uses a different method for estimating the parameters, which gives better results–better meaning unbiased, with lower variances.
If you want to learn all the ins and outs of dealing with logistic regression, check out our 8-hour live workshop Binary, Ordinal, and Multinomial Logistic Regression.
Send to Kindle




{ 5 comments… read them below or add one }
Dear Karen,
I’m trying a logistic regression model in marketing context. The DV is account “bought a product” or “did not buy the product”. I have several IVs of which one of them is revenue range. When I run the logistic regression, the output table indicates “revenue range” variable is significant overall (p-value = .000) but none of the levels of the variable is significant. I was wondering if there’s an explanation for this behavior. Thank you so much for your help.
Best,
Hi again,
I think my question above is related to selecting the reference category. Are there any best practices for selecting which level should be the reference? Thank you.
Best,
Hi Baris,
It sounds like it is. There are different choices. Sometimes one category is a clear group to compare all others to, like the largest, or a control group. Other times it makes sense to just choose the value with the mean in the middle or at one end. It’s whatever helps you interpret the results.
Karen
Hi Karen,
Thank you so much for your reply. Would you be kind to explain what you mean by “the value with the mean in the middle or at the end”? Thank you again.
Best,
Hi Baris,
Yes. What I meant is you should put the means for each group and in order, starting with the highest value to the lowest. Sometimes it makes the most sense to compare everything to the highest value (or lowest) or the central value.
Karen
{ 1 trackback }