When your dependent variable is not continuous, unbounded, and measured on an interval or ratio scale, linear models don’t fit. The data just will not meet the assumptions of linear models. But there’s good news, other models exist for many types of dependent variables.
Today I’m going to go into more detail about 6 common types of dependent variables that are either discrete, bounded, or measured on a nominal or ordinal scale and the tests that work for them instead. Some are all of these.
Logistic regression models can seem pretty overwhelming to the uninitiated. Why not use a regular regression model? Just turn Y into an indicator variable–Y=1 for success and Y=0 for failure.
For some good reasons.
1.It doesn’t make sense to model Y as a linear function of the parameters because Y has only two values. You just can’t make a line out of that (at least not one that fits the data well).
2. The predicted values can be any positive or negative number, not just 0 or 1.
3. The values of 0 and 1 are arbitrary.The important part is not to predict the numerical value of Y, but the probability that success or failure occurs, and the extent to which that probability depends on the predictor variables.
So okay, you say. Why not use a simple transformation of Y, like probability of success–the probability that Y=1.
Well, that doesn’t work so well either.
1. The right hand side of the equation can be any number, but the left hand side can only range from 0 to 1.
2. It turns out the relationship is not linear, but rather follows an S-shaped (or sigmoidal) curve.
To obtain a linear relationship, we need to transform this response too, Pr(success).
As luck would have it, there are a few functions that:
1. are not restricted to values between 0 and 1
2. will form a linear relationship with our parameters
These functions include:
All three of these work just as well, but (believe it or not) the Logit function is the easiest to interpret.
But as it turns out, you can’t just run the transformation then do a regular linear regression on the transformed data. That would be way too easy, but also give inaccurate results. Logistic Regression uses a different method for estimating the parameters, which gives better results–better meaning unbiased, with lower variances.
Researchers are often interested in setting up a model to analyze the relationship between some predictors (i.e., independent variables) and a response (i.e., dependent variable). Linear regression is commonly used when the response variable is continuous. One assumption of linear models is that the residual errors follow a normal distribution. This assumption fails when the response variable is categorical, so an ordinary linear model is not appropriate. This article presents a regression model for a response variable that is dichotomous–having two categories. Examples are common: whether a plant lives or dies, whether a survey respondent agrees or disagrees with a statement, or whether an at-risk child graduates or drops out from high school.
In ordinary linear regression, the response variable (Y) is a linear function of the coefficients (B0, B1, etc.) that correspond to the predictor variables (X1, X2, etc.). A typical model would look like:
Y = B0 + B1*X1 + B2*X2 + B3*X3 + … + E
For a dichotomous response variable, we could set up a similar linear model to predict individuals’ category memberships if numerical values are used to represent the two categories. Arbitrary values of 1 and 0 are chosen for mathematical convenience. Using the first example, we would assign Y = 1 if a plant lives and Y = 0 if a plant dies.
This linear model does not work well for a few reasons. First, the response values, 0 and 1, are arbitrary, so modeling the actual values of Y is not exactly of interest. Second, it is really the probability that each individual in the population responds with 0 or 1 that we are interested in modeling. For example, we may find that plants with a high level of a fungal infection (X1) fall into the category “the plant lives” (Y) less often than those plants with low level of infection. Thus, as the level of infection rises, the probability of a plant living decreases.
Thus, we might consider modeling P, the probability, as the response variable. Again, there are problems. Although the general decrease in probability is accompanied by a general increase in infection level, we know that P, like all probabilities, can only fall within the boundaries of 0 and 1. Consequently, it is better to assume that the relationship between X1 and P is sigmoidal (S-shaped), rather than a straight line.
It is possible, however, to find a linear relationship between X1 and a function of P. Although a number of functions work, one of the most useful is the logit function. It is the natural log of the odds that Y is equal to 1, which is simply the ratio of the probability that Y is 1 divided by the probability that Y is 0. The relationship between the logit of P and P itself is sigmoidal in shape. The regression equation that results is:
ln[P/(1-P)] = B0 + B1*X1 + B2*X2 + …
Although the left side of this equation looks intimidating, this way of expressing the probability results in the right side of the equation being linear and looking familiar to us. This helps us understand the meaning of the regression coefficients. The coefficients can easily be transformed so that their interpretation makes sense.
The logistic regression equation can be extended beyond the case of a dichotomous response variable to the cases of ordered categories and polytymous categories (more than two categories).