When Dependent Variables Are Not Fit for Linear Models, Now What?

by Karen

Share

When your dependent variable is not continuous, unbounded, and measured on an interval or ratio scale, your model will never meet the Assumptions of the General Linear Model (GLM).  Today I’m going to go into more detail about these 6 common types of dependent variables, and the tests that work instead.

Categorical Variables, including both binary (with 2 values) and multicategory (with 3 or more values) clearly fail all three criteria.  But there are a number of other types of regression models that do fit these variables.

For binary variables, probit and logistic regression models are the most common.  For multicategorical variables, use multinomial logistic regression.

Ordinal Variables are ordered categories.  They include rank and likert-item variables, although are not limited to these.  Although ordinal variables look like numbers, the distance between their values isn’t equal in a true numerical sense, so it doesn’t make sense to apply numerical operations, like addition and division, to them.  Like unordered categorical variables, ordinal variables require specialized logistic or probit models, such as the proportional odds model.

Discrete counts fail the assumptions of a GLM for many reasons.  The most obvious is that the normal distribution of a GLM allows any value on the number scale, but counts are bounded at 0.  It just doesn’t make sense to predict negative numbers of cigarettes smoked each day, children in a family, aggressive incidents.

But Poisson regression, or one of its brethren, are designed to accurately model count data.

Zero Inflated data have a huge spike in the distribution at 0. They are common in Poisson models, but can occur with any distribution.  A recent example I saw were scores on a depression scale.  The scale ran from 0 to 20, and 0 was by far the most common value (which is a good thing for the state of humanity, but it really messes up the GLM).  Even if the rest of the distribution is normal, you can’t transform zero inflated data to look normal.  A Zero-Inflated model, however, incorporates the high number of zeros by simultaneously modeling 0/Not 0 as a logistic regression and all the Not 0 values as another distribution.  It’s pretty cool, actually.

Censored or truncated data have full information about the values of the DV only for some values.  The distribution gets cut off for some values, often at the end of the distribution.   Examples include surveys that have exact income information for everyone up to $200k, but beyond that, everyone is just given “over $200k.”  In surveys, this is done for privacy issues–there just aren’t many people with such high incomes.  But sometimes it’s just a measurement issue.  Tobit regression models are designed to handle the imprecise measurements on some parts of the scale.

Proportion data, bounded at 0 and 1, or percentage data, bounded at 0 and 100, really become problematic if much of the data are close to the bounds.  If all the data fall in the middle portion, say in the .2 to .8 range, a GLM can give reasonably good results.  But beyond that, you need to either use a probit or logistic regression if the proportion measures discrete events (proportion of questions answered correctly) or a tobit regression if the proportion measures a continuous entity (proportion of time spent studying).


Bookmark and Share

{ 25 comments… read them below or add one }

hemis May 20, 2017 at 4:38 am

I have a response which is a measure of the severity of an accident, valued from 0 to 100 (integer). Which family in GLM should I use? Poisson? Binomial? or Gamma?

Reply

francis March 14, 2017 at 10:13 pm

HI karen,

I’m a student who is dealing with a survey for the first time.

I have a lot of variables but i want to choose the frequency of someone’s buying a product as the dependent variable for a linear regression model

it shows up like this 1=always; 2=2-3 times in a week; 3=1 once in a week; 4=once in a month; 5=almost never

and for another product liket this:
never,almost never,once in a week,2 times in a week

is it correct to use one of these two as a dependent variable in a linear regression model? i thought that the classe must be continuos inside them and among them,the first one doesn’t seem to be continuous among classes,while second seem! thank you for your answer

Reply

Pradeep November 12, 2016 at 1:27 am

I have used ECSI model to measure customer satisfaction and loyalty and also collected customer Socio demographic and personal characteristics (categorical in nature) and wanted to run logit regression to know the influence of such categorical factors on loyalty. My problem is, dependent variable is scale in nature (Likert) which is a summated average score of ECSI model. I converted the dependent variable on the basis of summated average score those are below the average is denoted as o and above average denoted as 1. like this i have divided into binary DV.

In this regard i need your kind suggestion is it a valid way to convert the scale dependent variable to binary DV, specially looking to the study.
Kindly help me sir so that i can take forward my research. If so i need citation for the same. Kindly help me.

Reply

Alberto October 7, 2016 at 5:05 am

Hi I have recently completed a log regression of 1 categorical variable vs 4 dependent variables. I have found the z score and chi values for these regressions however now I would like to know how i could rank the values within these variables to find “confidence intervals” ie if the value of the dependant variable is above X value what is the confident that this will cause the categorical variable to be “yes” or “no” for example.
Thanks
Alberto

Reply

Deanne April 27, 2016 at 11:24 am

Hey Karen
thank you for the helpful post.
I actually have a zero-inflated data problem and when running planned comparison on a glm model just doesn’t like to compare the mean between two different but “identical” (having exactly all the same values) populations.
Any suggestion on how to deal with that? I assume the problem is due to all the zeros I have in the two population, but seems like even manipulating the data by adding a 1 (my dependent variable is binary) each the problem persist…so I guess planned comparisons in glm just really don’t like samples having identical values…
(and no I can’t take out the comparison since is part of of the model)

Reply

Aziz February 26, 2016 at 8:05 pm

Hi Karen,

Very useful post. I have data of more than 8000 observations. My dependent variable is a binary variable consists on 0 and 1 only. Around 97% of the dependent variable values are zero. When I plot residual histogram, I see a big spike around zero (towards negative side of the zero, i.e.,[-1,0] interval), and extremely small spikes left of the zero spike and right of the zero spike at residual value of around 2. I estimated GLM and GAM with logistic regression. Can I improve upon my estimation here, ideally to show me a symmetric histogram for residuals? Thank you. Kind Regards, Aziz

Reply

Neetu January 25, 2016 at 5:40 am

Hi.. I have a dichotomous dep var and covariates are categorical(sex, birth loc and another ones are 5 point likert scale variables). I am using binary logistic regression. Is it right?

Reply

Matthew July 15, 2015 at 3:27 pm

Hi Karen,

Firstly thank you for this helpful article.

I have three proportion DVs and most of the data are very close to the bounds and the proportions across the three DVs all add to 100%.

Unfortunately I can’t find a way to include all three DVs in one analysis like you would do in multivariate. Do you think it would be ok to do three different sets of binomial regressions (events/trials) in GLM for each DV or am I risking bias/errors?

Thanks.

Reply

Karen July 18, 2015 at 9:52 am

Hi Matthew,

Hmm, if they add to 100%, do you really need all three. If you know each persons’ answer to two, you know their answer to the third, right?

This is a tricky one…

Karen

Reply

Sherryll March 9, 2015 at 1:27 pm

Hi,
i have panel data and the dependent variable is a calculated ratio. The explanatory variables consist of some macroeconomic indicators and other control variables. The aim of the analysis is to use variations in the Xs to explain the existence of cycles (given by fluctuations in Y). However, I am quite unsure about which model best fits the data. I’d be pleased to get a few suggestions.
Thanks.

Reply

Karen March 23, 2015 at 12:29 pm

Hi Sherryll,

As always, it depends on all the details. Is it a ratio that is bounded anywhere? Some are bounded, for example at 0 and 1, but not all are.

Reply

jessica February 15, 2015 at 2:47 pm

What exactly constitutes a “huge spike in the distribution at 0”? Is there a numeric cut off, perhaps if 50% of the participants scored a zero then one should implement a zero inflated model?

Reply

Ian Potter December 3, 2014 at 5:12 am

Hi Karen,
I apologise in advance if my question appears too rudimentary, but the statistics textbooks provide no clear answer, and my presence here is evidence of my research online.

1. I have a series of likert scale questionnaires that I want to check for mean differences in scores according to the different conditions of my IV (x4 levels). I have carried out a reliability analysis; the Cronbach alpha’s are mostly acceptable. What is the acceptable way (my discipline is Social Science) to transform the different scores for each person for each questionnaire into a single score (is this what I have to do?)

2. For my second study, the questionnaires were responded to first before participants were divided into different treatments of my IV ( x4 levels) followed by responding to two more questionnaires (my DVs). I predicted that high scores on the initial questionnaires will predict high scores on the DVs if IV level = W & X, and the inverse if IV level = Y & Z.
Your response will be greatly appreciated. I am a self-funded doctorate student, a bit isolated, and statistics is not my adviser’s strongest point.

Thanks,
Ian Potter

Reply

Pisie July 19, 2014 at 9:36 pm

Hi, I like this post, I have a question, mi set data is only one numerical variable (egg per gram of feces) and the others are categorical (sex, category age, etc) can I make a GLMM with this set data?

Reply

Loreth June 2, 2014 at 7:03 pm

Hello,
I have as a response variable minutes spent feeding, which looks skewed to the left. Someone suggested me using a Gamma distribution, but I am not too sure. Any suggestions? To put it into context I want to know whether diameter of a tree (DBH), tree species or/and season has an effect on the duration of an animal spends feeding on a specific tree. Thanks!

Reply

Endriyas May 1, 2014 at 9:07 am

Hi karen,
I would like to know how likert scale can be changed to logistic regression?

Reply

Karen May 6, 2014 at 3:06 pm

Hi Endriyas,

The scale wouldn’t be changed (usually). You just need ordinal logistic regression. I would start with this webinar on logistic regression. It’s a free download.

Reply

Laura March 10, 2014 at 2:40 pm

Hi Karen,
Thanks for the interesting post. I have proportional data that is zero-inflated – the proportion of a carivores’ diet consisting of small mammals, with lots of results recording no small mammals at all.
Would this be a zero-inflated tobit regression? Is there such a thing??
Thanks
Laura

Reply

Karen March 10, 2014 at 4:57 pm

Hi Laura,

Sounds like it. I haven’t heard specifically of a zero-inflated tobit model, but it could exist. Zero inflated models are part of a class of models called mixture models, which combine two models.

Reply

Cory July 29, 2013 at 2:55 pm

I find this post very confusing – GLM typically stands for generalized linear models which were formulated as a way of unifying various other statistical models, including linear regression, logistic regression and Poisson regression.

Reply

Karen August 7, 2013 at 3:35 pm

Hi Cory, it’s one of those confusing terms. GLM also stands for General Linear Model, which is what I meant here.

Reply

Stan May 29, 2013 at 4:32 am

Karen, thanks for the information you give here. It’s very helpful.
I have a question. Is it correct to use a generalized linear mixed model when my data are percentages? Thank you.

Stan

Reply

Karen June 6, 2013 at 5:25 pm

It could be. Usually percentages (proportions, actually) either have to be considered binomial or possibly a beta distribution (although for a beta distributions, there can’t be any 0 or 1 proportions). Both would work in a GLMM.

Reply

Bereket.Y February 14, 2013 at 5:54 am

Can i use Poisson or one of count data model for the out comes of the dependent variable contains only o,1,2,3 and 4

Reply

Karen February 20, 2013 at 4:55 pm

Potentially. Is it truly a count? You can always run in then check assumptions. If they’re met, then sure.

Karen

Reply

Leave a Comment

Please note that Karen receives hundreds of comments at The Analysis Factor website each week. Since Karen is also busy teaching workshops, consulting with clients, and running a membership program, she seldom has time to respond to these comments anymore. If you have a question to which you need a timely response, please check out our low-cost monthly membership program, or sign-up for a quick question consultation.

Previous post:

Next post: