When Dependent Variables Are Not Fit for Linear Models, Now What?

by Karen


When your dependent variable is not continuous, unbounded, and measured on an interval or ratio scale, your model will never meet the Assumptions of the General Linear Model (GLM).  Today I’m going to go into more detail about these 6 common types of dependent variables, and the tests that work instead.

Categorical Variables, including both binary (with 2 values) and multicategory (with 3 or more values) clearly fail all three criteria.  But there are a number of other types of regression models that do fit these variables.

For binary variables, probit and logistic regression models are the most common.  For multicategorical variables, use multinomial logistic regression.

Ordinal Variables are ordered categories.  They include rank and likert-item variables, although are not limited to these.  Although ordinal variables look like numbers, the distance between their values isn’t equal in a true numerical sense, so it doesn’t make sense to apply numerical operations, like addition and division, to them.  Like unordered categorical variables, ordinal variables require specialized logistic or probit models, such as the proportional odds model.

Discrete counts fail the assumptions of a GLM for many reasons.  The most obvious is that the normal distribution of a GLM allows any value on the number scale, but counts are bounded at 0.  It just doesn’t make sense to predict negative numbers of cigarettes smoked each day, children in a family, aggressive incidents.

But Poisson regression, or one of its brethren, are designed to accurately model count data.

Zero Inflated data have a huge spike in the distribution at 0. They are common in Poisson models, but can occur with any distribution.  A recent example I saw were scores on a depression scale.  The scale ran from 0 to 20, and 0 was by far the most common value (which is a good thing for the state of humanity, but it really messes up the GLM).  Even if the rest of the distribution is normal, you can’t transform zero inflated data to look normal.  A Zero-Inflated model, however, incorporates the high number of zeros by simultaneously modeling 0/Not 0 as a logistic regression and all the Not 0 values as another distribution.  It’s pretty cool, actually.

Censored or truncated data have full information about the values of the DV only for some values.  The distribution gets cut off for some values, often at the end of the distribution.   Examples include surveys that have exact income information for everyone up to $200k, but beyond that, everyone is just given “over $200k.”  In surveys, this is done for privacy issues–there just aren’t many people with such high incomes.  But sometimes it’s just a measurement issue.  Tobit regression models are designed to handle the imprecise measurements on some parts of the scale.

Proportion data, bounded at 0 and 1, or percentage data, bounded at 0 and 100, really become problematic if much of the data are close to the bounds.  If all the data fall in the middle portion, say in the .2 to .8 range, a GLM can give reasonably good results.  But beyond that, you need to either use a probit or logistic regression if the proportion measures discrete events (proportion of questions answered correctly) or a tobit regression if the proportion measures a continuous entity (proportion of time spent studying).

Bookmark and Share

{ 26 comments… read them below or add one }

Ishanka August 24, 2017 at 8:58 am

Dear Karen,
I would like to direct you my question, as I am struggling with my data analysis using binary logistic regression. In my study, the dependent variable is dichotomous, because of that I used binary logistic regression to analyze the data(Spss program).
I got the results, but beta coefficients do not make sense because the values are greater than 1. now I am struggling with transforming beta coefficient to meaningful values.
Could you please advise me on this problem.
Thank you.


hemis May 20, 2017 at 4:38 am

I have a response which is a measure of the severity of an accident, valued from 0 to 100 (integer). Which family in GLM should I use? Poisson? Binomial? or Gamma?


francis March 14, 2017 at 10:13 pm

HI karen,

I’m a student who is dealing with a survey for the first time.

I have a lot of variables but i want to choose the frequency of someone’s buying a product as the dependent variable for a linear regression model

it shows up like this 1=always; 2=2-3 times in a week; 3=1 once in a week; 4=once in a month; 5=almost never

and for another product liket this:
never,almost never,once in a week,2 times in a week

is it correct to use one of these two as a dependent variable in a linear regression model? i thought that the classe must be continuos inside them and among them,the first one doesn’t seem to be continuous among classes,while second seem! thank you for your answer


Pradeep November 12, 2016 at 1:27 am

I have used ECSI model to measure customer satisfaction and loyalty and also collected customer Socio demographic and personal characteristics (categorical in nature) and wanted to run logit regression to know the influence of such categorical factors on loyalty. My problem is, dependent variable is scale in nature (Likert) which is a summated average score of ECSI model. I converted the dependent variable on the basis of summated average score those are below the average is denoted as o and above average denoted as 1. like this i have divided into binary DV.

In this regard i need your kind suggestion is it a valid way to convert the scale dependent variable to binary DV, specially looking to the study.
Kindly help me sir so that i can take forward my research. If so i need citation for the same. Kindly help me.


Alberto October 7, 2016 at 5:05 am

Hi I have recently completed a log regression of 1 categorical variable vs 4 dependent variables. I have found the z score and chi values for these regressions however now I would like to know how i could rank the values within these variables to find “confidence intervals” ie if the value of the dependant variable is above X value what is the confident that this will cause the categorical variable to be “yes” or “no” for example.


Deanne April 27, 2016 at 11:24 am

Hey Karen
thank you for the helpful post.
I actually have a zero-inflated data problem and when running planned comparison on a glm model just doesn’t like to compare the mean between two different but “identical” (having exactly all the same values) populations.
Any suggestion on how to deal with that? I assume the problem is due to all the zeros I have in the two population, but seems like even manipulating the data by adding a 1 (my dependent variable is binary) each the problem persist…so I guess planned comparisons in glm just really don’t like samples having identical values…
(and no I can’t take out the comparison since is part of of the model)


Aziz February 26, 2016 at 8:05 pm

Hi Karen,

Very useful post. I have data of more than 8000 observations. My dependent variable is a binary variable consists on 0 and 1 only. Around 97% of the dependent variable values are zero. When I plot residual histogram, I see a big spike around zero (towards negative side of the zero, i.e.,[-1,0] interval), and extremely small spikes left of the zero spike and right of the zero spike at residual value of around 2. I estimated GLM and GAM with logistic regression. Can I improve upon my estimation here, ideally to show me a symmetric histogram for residuals? Thank you. Kind Regards, Aziz


Neetu January 25, 2016 at 5:40 am

Hi.. I have a dichotomous dep var and covariates are categorical(sex, birth loc and another ones are 5 point likert scale variables). I am using binary logistic regression. Is it right?


Matthew July 15, 2015 at 3:27 pm

Hi Karen,

Firstly thank you for this helpful article.

I have three proportion DVs and most of the data are very close to the bounds and the proportions across the three DVs all add to 100%.

Unfortunately I can’t find a way to include all three DVs in one analysis like you would do in multivariate. Do you think it would be ok to do three different sets of binomial regressions (events/trials) in GLM for each DV or am I risking bias/errors?



Karen July 18, 2015 at 9:52 am

Hi Matthew,

Hmm, if they add to 100%, do you really need all three. If you know each persons’ answer to two, you know their answer to the third, right?

This is a tricky one…



Sherryll March 9, 2015 at 1:27 pm

i have panel data and the dependent variable is a calculated ratio. The explanatory variables consist of some macroeconomic indicators and other control variables. The aim of the analysis is to use variations in the Xs to explain the existence of cycles (given by fluctuations in Y). However, I am quite unsure about which model best fits the data. I’d be pleased to get a few suggestions.


Karen March 23, 2015 at 12:29 pm

Hi Sherryll,

As always, it depends on all the details. Is it a ratio that is bounded anywhere? Some are bounded, for example at 0 and 1, but not all are.


jessica February 15, 2015 at 2:47 pm

What exactly constitutes a “huge spike in the distribution at 0”? Is there a numeric cut off, perhaps if 50% of the participants scored a zero then one should implement a zero inflated model?


Ian Potter December 3, 2014 at 5:12 am

Hi Karen,
I apologise in advance if my question appears too rudimentary, but the statistics textbooks provide no clear answer, and my presence here is evidence of my research online.

1. I have a series of likert scale questionnaires that I want to check for mean differences in scores according to the different conditions of my IV (x4 levels). I have carried out a reliability analysis; the Cronbach alpha’s are mostly acceptable. What is the acceptable way (my discipline is Social Science) to transform the different scores for each person for each questionnaire into a single score (is this what I have to do?)

2. For my second study, the questionnaires were responded to first before participants were divided into different treatments of my IV ( x4 levels) followed by responding to two more questionnaires (my DVs). I predicted that high scores on the initial questionnaires will predict high scores on the DVs if IV level = W & X, and the inverse if IV level = Y & Z.
Your response will be greatly appreciated. I am a self-funded doctorate student, a bit isolated, and statistics is not my adviser’s strongest point.

Ian Potter


Pisie July 19, 2014 at 9:36 pm

Hi, I like this post, I have a question, mi set data is only one numerical variable (egg per gram of feces) and the others are categorical (sex, category age, etc) can I make a GLMM with this set data?


Loreth June 2, 2014 at 7:03 pm

I have as a response variable minutes spent feeding, which looks skewed to the left. Someone suggested me using a Gamma distribution, but I am not too sure. Any suggestions? To put it into context I want to know whether diameter of a tree (DBH), tree species or/and season has an effect on the duration of an animal spends feeding on a specific tree. Thanks!


Endriyas May 1, 2014 at 9:07 am

Hi karen,
I would like to know how likert scale can be changed to logistic regression?


Karen May 6, 2014 at 3:06 pm

Hi Endriyas,

The scale wouldn’t be changed (usually). You just need ordinal logistic regression. I would start with this webinar on logistic regression. It’s a free download.


Laura March 10, 2014 at 2:40 pm

Hi Karen,
Thanks for the interesting post. I have proportional data that is zero-inflated – the proportion of a carivores’ diet consisting of small mammals, with lots of results recording no small mammals at all.
Would this be a zero-inflated tobit regression? Is there such a thing??


Karen March 10, 2014 at 4:57 pm

Hi Laura,

Sounds like it. I haven’t heard specifically of a zero-inflated tobit model, but it could exist. Zero inflated models are part of a class of models called mixture models, which combine two models.


Cory July 29, 2013 at 2:55 pm

I find this post very confusing – GLM typically stands for generalized linear models which were formulated as a way of unifying various other statistical models, including linear regression, logistic regression and Poisson regression.


Karen August 7, 2013 at 3:35 pm

Hi Cory, it’s one of those confusing terms. GLM also stands for General Linear Model, which is what I meant here.


Stan May 29, 2013 at 4:32 am

Karen, thanks for the information you give here. It’s very helpful.
I have a question. Is it correct to use a generalized linear mixed model when my data are percentages? Thank you.



Karen June 6, 2013 at 5:25 pm

It could be. Usually percentages (proportions, actually) either have to be considered binomial or possibly a beta distribution (although for a beta distributions, there can’t be any 0 or 1 proportions). Both would work in a GLMM.


Bereket.Y February 14, 2013 at 5:54 am

Can i use Poisson or one of count data model for the out comes of the dependent variable contains only o,1,2,3 and 4


Karen February 20, 2013 at 4:55 pm

Potentially. Is it truly a count? You can always run in then check assumptions. If they’re met, then sure.



Leave a Comment

Please note that Karen receives hundreds of comments at The Analysis Factor website each week. Since Karen is also busy teaching workshops, consulting with clients, and running a membership program, she seldom has time to respond to these comments anymore. If you have a question to which you need a timely response, please check out our low-cost monthly membership program, or sign-up for a quick question consultation.

Previous post:

Next post: