When the dependent variable in a regression model is a proportion or a percentage, it can be tricky to decide on the appropriate way to model it.

The big problem with ordinary linear regression is that the model can predict values that aren’t possible–values below 0 or above 1. But the other problem is that the relationship isn’t linear–it’s sigmoidal. A sigmoidal curve looks like a flattened S–linear in the middle, but flattened on the ends. So now what?

The simplest approach is to do a linear regression anyway. This approach can be justified only in a few situations.

1. All your data fall in the middle, linear section of the curve. This generally translates to all your data being between .2 and .8 (although I’ve heard that between .3-.7 is better). If this holds, you don’t have to worry about the two objections. You do have a linear relationship, and you won’t get predicted values much beyond those values–certainly not beyond 0 or 1.

2. It is a really complicated model that would be much harder to model another way. If you can assume a linear model, it will be much easier to do, say, a complicated mixed model or a structural equation model. If it’s just a single multiple regression, however, you should look into one of the other methods.

A second approach is to treat the proportion as a binary response then run a logistic or probit regression. This will only work if the proportion can be thought of and you have the data for the number of successes and the total number of trials. For example, the proportion of land area covered with a certain species of plant would be hard to think of this way, but the proportion of correct answers on a 20-answer assessment would.

The third approach is to treat it the proportion as a censored continuous variable. The censoring means that you don’t have information below 0 or above 1. For example, perhaps the plant would spread even more if it hadn’t run out of land. If you take this approach, you would run the model as a two-limit tobit model (Long, 1997). This approach works best if there isn’t an excessive amount of censoring (values of 0 and 1).

Reference: Long, J.S. (1997). Regression Models for Categorical and Limited Dependent Variables. Sage Publishing.

{ 12 comments… read them below or add one }

I’m running fractional logit for analyzing the Eco-efficiency. The Pseudo R2 is about 0.01. I searched the other studies and find out their work have the low amout of Pseudo R2. Can I interpret the model with this distribution of data?

I’m using STATA to analyzing the fractional model but I couldn’t get AIC and BIC, I would be very grateful if you help me. I look forward to hearing from you.

My dependent variable is Probability of species being extant, so bounded between 0 and 1. It has a J-shaped distribution – the main peak around 0.9 and a secondary peak around 0.1. I have 2 explanatory variables and 2 random factors. I was wondering about trying a beta regression? But my stats is pretty limited so I would rather not! Any advice would be much appreciated.

Thanks.

Hello Elizabeth,

I have the same exactly issue, did you find an answer anywhere? Or did you run the beta distribution?

Thank you,

Eleni

Hi, I did an analysis of damage to starfish from fishing activities. My explanatory variables are two categorical variables (Vessel Type (“VESSEL” – three categories) and On-Deck sorting Method (“SORT” – two categories)). For each VESSEL and SORT combination i sampled several fishing trips. In each fishing trip I filled a basket with catch and gathered the starfish from the basket and classified them according to their damage level (on a scale from DL1 to DL4). The number of starfish varies between baskets. I have attempted to understand the effect of VESSEL and SORT on the damage levels of starfish. I followed Crawley’s The R Book. I converted the damage level data to proportions (e.g. proportion of Damage Level 1 individuals out of the total number of starfish sampled in that fishing trip) and fitted a generalised linear model with a binomial distribution to the data. Predictions from the fitted model give me completely non-sensical results. Should I try to run a logistic regression treating the data as binary(e.g. DL1 and (DL2, 3 and 4)?

I suggest a tobit transformation for the dependent variable i.e.

y* = ln(y/(1-y))

y* is then used as the dependent in a linear regression model. However, you have to take care to interpret the result.

One drawback to this approach is if 0 and 1 are possible values of y. Hence, this approach will return missing for those values of y that are exactly 0 or 1. An alternative approach was suggested by Allen and Nicholas (http://www.stata.com/support/faqs/statistics/logit-transformation/) or Baum (http://www.stata-journal.com/sjpdf.html?articlenum=st0147).

Hi,

here is another suggestion, when your depend variable is a proportion.This is often the case when you have computed an index for example. A tbot regression makes not much sense in these situatons I guess, since indices cannot be below/above 0/1. Therefore, the assumption about a latent variable in tobit-models is misleading.

Here the suggestion how to deal with this in another way for STATA:

http://www.ats.ucla.edu/stat/stata/faq/proportion.htm

True, there are other approaches.

An arcsin(sqrt) transformation works sometimes.

So does a generalized linear model, with a beta distribution. None works all the time–it depends on the details of the analysis.

I’m currently working on a data set that includes Free/Reduced priced lunch status of public school students. This data is used as a proportion where the range is between .1-.99 (i.e. 99% of students in a given district qualify for this).

We’d like to see the relationship between this variable and our other variable of interest that indicates poverty level (however this may also be in proportion format)

1. How should I handle the model since it is clearly not between .3 and .7.

2. Can I have 2 proportions for both independent and dependent variables in my regression model?

Thanks in advance!

Hi Ally,

First, the proportion IV isn’t a problem. It’s that IV.

There are a few different ways to approach it, including a generalized linear model with a Beta distribution or a tobit regression. You could also try the linear model—just check assumptions! The biggest problem with bounded values is that often there are MANY observations pushed up against the bound, so residuals aren’t even close to normal.

Does anyone happen to have a reference for the limits of the interval within which the sigmoidal curve can be assumed to be linear?

I am currently running a tobit regression on the the extent of the area under rice cultivation in Northern Ghana. The dependent variable is measured as the proportion of area under rice cultivation (Area under rice cultivation by household i/Total area of land under all crop production by the same household i). After running the tobit regression, I had a negative pseudo r squared value which I believe is not good result. Can somebody advise me on what to do.

Ooh, you’re right, that doesn’t sound good. pseudo-R squareds, though, aren’t exactly the same as R squared, though, so it may, at best, just mean a poorly fitting model. I’m not sure I have a good suggestion on what to do. I’d have to really look into what you’ve got.

Karen