Karen Grace-Martin

Proportions as Dependent Variable in Regression–Which Type of Model?

January 26th, 2009 by Karen Grace-Martin

When the dependent variable in a regression model is a proportion or a percentage, it can be tricky to decide on the appropriate way to model it.

The big problem with ordinary linear regression is that the model can predict values that aren’t possible–values below 0 or above 1. But the other problem is that the relationship isn’t linear–it’s sigmoidal. A sigmoidal curve looks like a flattened S–linear in the middle, but flattened on the ends. So now what?

The simplest approach is to do a linear regression anyway. This approach can be justified only in a few situations.

1. All your data fall in the middle, linear section of the curve. This generally translates to all your data being between .2 and .8 (although I’ve heard that between .3-.7 is better). If this holds, you don’t have to worry about the two objections. You do have a linear relationship, and you won’t get predicted values much beyond those values–certainly not beyond 0 or 1.

2. It is a really complicated model that would be much harder to model another way. If you can assume a linear model, it will be much easier to do, say, a complicated mixed model or a structural equation model. If it’s just a single multiple regression, however, you should look into one of the other methods.

A second approach is to treat the proportion as a binary response then run a logistic or probit regression. This will only work if the proportion can be thought of and you have the data for the number of successes and the total number of trials. For example, the proportion of land area covered with a certain species of plant would be hard to think of this way, but the proportion of correct answers on a 20-answer assessment would.

The third approach is to treat it the proportion as a censored continuous variable. The censoring means that you don’t have information below 0 or above 1. For example, perhaps the plant would spread even more if it hadn’t run out of land. If you take this approach, you would run the model as a two-limit tobit model (Long, 1997). This approach works best if there isn’t an excessive amount of censoring (values of 0 and 1).

Reference: Long, J.S. (1997). Regression Models for Categorical and Limited Dependent Variables. Sage Publishing.

16 comments

The Exposure Variable in Poisson Regression Models

January 23rd, 2009 by Karen Grace-Martin

Poisson Regression Models and its extensions (Zero-Inflated Poisson, Negative Binomial Regression, etc.) are used to model counts and rates. A few examples of count variables include:

– Number of words an eighteen month old can say

– Number of aggressive incidents performed by patients in an impatient rehab center

Most count variables follow one of these distributions in the Poisson family. Poisson regression models allow researchers to examine the relationship between predictors and count outcome variables.

Using these regression models gives much more accurate parameter (more…)

43 comments

Interpreting Interactions in Regression

January 19th, 2009 by Karen Grace-Martin

Adding interaction terms to a regression model has real benefits. It greatly expands your understanding of the relationships among the variables in the model. And you can test more specific hypotheses. But interpreting interactions in regression takes understanding of what each coefficient is telling you.

The example from Interpreting Regression Coefficients was a model of the height of a shrub (Height) based on the amount of bacteria in the soil (Bacteria) and whether the shrub is located in partial or full sun (Sun). Height is measured in cm, Bacteria is measured in thousand per ml of soil, and Sun = 0 if the plant is in partial sun, and Sun = 1 if the plant is in full sun.

(more…)

31 comments

Logistic Regression Models for Multinomial and Ordinal Variables

January 14th, 2009 by Karen Grace-Martin

Multinomial Logistic Regression

The multinomial (a.k.a. polytomous) logistic regression model is a simple extension of the binomial logistic regression model. They are used when the dependent variable has more than two nominal (unordered) categories.

Dummy coding of independent variables is quite common. In multinomial logistic regression the dependent variable is dummy coded into multiple 1/0 variables. There is a variable for all categories but one, so if there are M categories, there will be M-1 dummy variables. All but one category has its own dummy variable. Each category’s dummy variable has a value of 1 for its category and a 0 for all others. One category, the reference category, doesn’t need its own dummy variable as it is uniquely identified by all the other variables being 0.

The multinomial logistic regression then estimates a separate binary logistic regression model for each of those dummy variables. The result is (more…)

59 comments

The Great Likert Data Debate

January 9th, 2009 by Karen Grace-Martin

I first encountered the Great Likert Data Debate in 1992 in my first statistics class in my psychology graduate program.

My stats professor was a brilliant mathematical psychologist and taught the class unlike any psychology grad class I’ve ever seen since. Rather than learn ANOVA in SPSS, we derived the Method of Moments using Matlab. While I didn’t understand half of what was going on, this class roused my curiosity and led me to take more theoretical statistics classes. The rest is history.

A large section of the class was dedicated to the fact that Likert data was not interval and therefore not appropriate for statistics that assume normality such as ANOVA and regression. This was news to me. Meanwhile, most of the rest of the field either ignored or debated this assertion.

16 years later, the debate continues. A nice discussion of the debate is found on the Research Methodology blog by Hisham bin Md-Basir. It’s a nice blog with thoughtful entries that summarize methodological articles in the social and design sciences.

To be fair, though, this blog entry summarizes an article on the “Likert scales are not interval” side of the debate. For a balanced listing of references, see Can Likert Scale Data Ever Be Continuous?

1 comment

Variable Labels and Value Labels in SPSS

January 2nd, 2009 by Karen Grace-Martin

SPSS Variable Labels and Value Labels are two of the great features of its ability to create a code book right in the data set. Using these every time is good data analysis practice.

SPSS doesn’t limit variable names to 8 characters like it used to, but you still can’t use spaces, and it will make coding easier if you keep the variable names short. You then use Variable Labels to give a nice, long description of each variable. On questionnaires, I often use the actual question.

There are good reasons for using Variable Labels right in the data set. I know you want to get right to your data analysis, but using Variable Labels will save so much time later.

1. If your paper code sheet ever gets lost, you still have the variable names.

2. Anyone else who uses your data–lab assistants, graduate students, statisticians–will immediately know what each variable means.

3. As entrenched as you are with your data right now, you will forget what those variable names refer to within months. When a committee member or reviewer wants you to redo an analysis, it will save tons of time to have those variable labels right there.

4. It’s just more efficient–you don’t have to look up what those variable names mean when you read your output.

Variable Labels

The really nice part is SPSS makes Variable Labels easy to use:

1. Mouse over the variable name in the Data View spreadsheet to see the Variable Label.

2. In dialog boxes, lists of variables can be shown with either Variable Names or Variable Labels. Just go to Edit–>Options. In the General tab, choose Display Labels.

3. On the output, SPSS allows you to print out Variable Names or Variable Labels or both. I usually like to have both. Just go to Edit–>Options. In the Output tab, choose ‘Names and Labels’ in the first and third boxes.

Value Labels

Value Labels are similar, but Value Labels are descriptions of the values a variable can take. Labeling values right in SPSS means you don’t have to remember if 1=Strongly Agree and 5=Strongly Disagree or vice-versa. And it makes data entry much more efficient–you can type in 1 and 0 for Male and Female much faster than you can type out those whole words, or even M and F. But by having Value Labels, your data and output still give you the meaningful values.

Once again, SPSS makes it easy for you.

1. If you’d rather see Male and Female in the data set than 0 and 1, go to View–>Value Labels.

2. Like Variable Labels, you can get Value Labels on output, along with the actual values. Just go to Edit–>Options. In the ‘Output Labels’ tab, choose ‘Values and Labels’ in the second and fourth boxes.

100 comments