• Skip to primary navigation
  • Skip to main content
  • Skip to primary sidebar
The Analysis Factor

The Analysis Factor

Statistical Consulting, Resources, and Statistics Workshops for Researchers

  • Home
  • About
    • Our Programs
    • Our Team
    • Our Core Values
    • Our Privacy Policy
    • Employment
    • Guest Instructors
  • Membership
    • Statistically Speaking Membership Program
    • Login
  • Workshops
    • Online Workshops
    • Login
  • Consulting
    • Statistical Consulting Services
    • Login
  • Free Webinars
  • Contact
  • Login

When Dependent Variables Are Not Fit for Linear Models, Now What?

by Karen Grace-Martin 28 Comments

When your dependent variable is not continuous, unbounded, and measured on an interval or ratio scale, your model will not meet the assumptions of linear models.

Today I’m going to go into more detail about 6 common types of dependent variables that are not continuous, unbounded, and measured on an interval or ratio scale and the tests that work instead.

Side note: the usual advice is to use nonparametric tests when normality assumptions fail. That works when you’re doing something simple, like a correlation or comparing group means. But if you’re including covariates or interactions in a model, you need a real model.

Categorical Variables

Both binary (2 values) and multicategory (3 or more values) variables clearly fail all three criteria.  But there are other types of regression models that work just fine for these variables.

For binary variables, probit and logistic regression models are the most common.  For multicategorical variables, use multinomial logistic regression.

Ordinal Variables

These variables are made up of ordered categories.  They include rank and likert-item variables, although are not limited to these.

Although ordinal variables look like numbers, the distances between their values aren’t equal in a true numerical sense, so it doesn’t make sense to apply numerical operations, like addition and division, to them. Hence means, the basis of linear models, don’t really compute.

Like unordered categorical variables, ordinal variables require specialized logistic or probit models, such as the proportional odds model. There are a few other types of ordinal models, but the proportional odds model is most commonly available.

Count Variables

Discrete counts fail the assumptions of linear models for many reasons.  The most obvious is that the normal distribution of linear models allows any value on the number scale, but counts are bounded at 0.  It just doesn’t make sense to predict negative numbers of cigarettes smoked each day, children in a family, or aggressive incidents.

But Poisson regression, or related models like negative binomial, are designed to accurately model count data.

Zero Inflated Variables

Zero Inflated data have a spike in the distribution at 0.

They are common in Poisson data, but can occur with any distribution.  A recent example I saw were scores on a depression scale.  The scale ran from 0 to 20, and 0 was by far the most common value (which is a good thing for the state of humanity, but it really messes up the linear model assumptions).

Even if the rest of the distribution is normal, you can’t transform zero inflated data to look normal.  A Zero-Inflated model, however, incorporates the high number of zeros by simultaneously modeling 0/Not 0 as a logistic regression and all the Not 0 values as another distribution.  It’s pretty cool, actually.

Censored Variables

Censored data have full information about the values of the DV only for some values.  The distribution gets cut off for some values, often at the end of the distribution.

Examples include surveys that have exact income information for everyone up to $200k, but beyond that, everyone is just given “over $200k.”  In surveys, this is done for privacy issues–there just aren’t many people with such high incomes.

But sometimes it’s just a measurement issue.  Tobit regression models are designed to handle the imprecise measurements on some parts of the scale.

Proportions

Proportions, bounded at 0 and 1, or percentages, bounded at 0 and 100, really become problematic if much of the data are close to the bounds.

If all the data fall in the middle portion, say in the .2 to .8 range, a linear model can give reasonably good results.  But beyond that, you need to either use a beta regression if the proportion is continuous or logistic regression if the proportion measures discrete events with a certain outcome (proportion of questions answered correctly).

 

Binary, Ordinal, and Multinomial Logistic Regression for Categorical Outcomes
Get beyond the frustration of learning odds ratios, logit link functions, and proportional odds assumptions on your own. See the incredible usefulness of logistic regression and categorical data analysis in this one-hour training.

Tagged With: binary variable, categorical variable, Censored, dependent variable, Discrete Counts, Multinomial, ordinal variable, Poisson Regression, Proportion, Proportional Odds Model, regression models, Truncated, Zero Inflated

Related Posts

  • 6 Types of Dependent Variables that will Never Meet the Linear Model Normality Assumption
  • Member Training: Types of Regression Models and When to Use Them
  • When to Check Model Assumptions
  • Proportions as Dependent Variable in Regression–Which Type of Model?

Reader Interactions

Comments

  1. Mark Mohan says

    January 23, 2019 at 1:42 am

    Dear Karen!
    Thanks for the interesting post.

    If dependent variable is zero inflated continous variable (more zeros with both negative and possitive values between 0 and 1), what are the appropreate regression types? (independetnvariables have both continous and categorical variables)

    Thank you

    Reply
    • Karen Grace-Martin says

      March 4, 2019 at 11:28 am

      Hi Mark,

      It’s hard to say without looking at it and getting all the details. It’s possible you could just do a linear regression. Run one and take a look at the distribution of the residuals. They may look normal.

      Or it may be that 0 is qualitatively different in some way than the other values, in which case you’d need a zero inflated model.

      Reply
  2. Ishanka says

    August 24, 2017 at 8:58 am

    Dear Karen,
    I would like to direct you my question, as I am struggling with my data analysis using binary logistic regression. In my study, the dependent variable is dichotomous, because of that I used binary logistic regression to analyze the data(Spss program).
    I got the results, but beta coefficients do not make sense because the values are greater than 1. now I am struggling with transforming beta coefficient to meaningful values.
    Could you please advise me on this problem.
    Thank you.
    Ishanka

    Reply
  3. hemis says

    May 20, 2017 at 4:38 am

    I have a response which is a measure of the severity of an accident, valued from 0 to 100 (integer). Which family in GLM should I use? Poisson? Binomial? or Gamma?

    Reply
  4. francis says

    March 14, 2017 at 10:13 pm

    HI karen,

    I’m a student who is dealing with a survey for the first time.

    I have a lot of variables but i want to choose the frequency of someone’s buying a product as the dependent variable for a linear regression model

    it shows up like this 1=always; 2=2-3 times in a week; 3=1 once in a week; 4=once in a month; 5=almost never

    and for another product liket this:
    never,almost never,once in a week,2 times in a week

    is it correct to use one of these two as a dependent variable in a linear regression model? i thought that the classe must be continuos inside them and among them,the first one doesn’t seem to be continuous among classes,while second seem! thank you for your answer

    Reply
  5. Pradeep says

    November 12, 2016 at 1:27 am

    I have used ECSI model to measure customer satisfaction and loyalty and also collected customer Socio demographic and personal characteristics (categorical in nature) and wanted to run logit regression to know the influence of such categorical factors on loyalty. My problem is, dependent variable is scale in nature (Likert) which is a summated average score of ECSI model. I converted the dependent variable on the basis of summated average score those are below the average is denoted as o and above average denoted as 1. like this i have divided into binary DV.

    In this regard i need your kind suggestion is it a valid way to convert the scale dependent variable to binary DV, specially looking to the study.
    Kindly help me sir so that i can take forward my research. If so i need citation for the same. Kindly help me.

    Reply
  6. Alberto says

    October 7, 2016 at 5:05 am

    Hi I have recently completed a log regression of 1 categorical variable vs 4 dependent variables. I have found the z score and chi values for these regressions however now I would like to know how i could rank the values within these variables to find “confidence intervals” ie if the value of the dependant variable is above X value what is the confident that this will cause the categorical variable to be “yes” or “no” for example.
    Thanks
    Alberto

    Reply
  7. Deanne says

    April 27, 2016 at 11:24 am

    Hey Karen
    thank you for the helpful post.
    I actually have a zero-inflated data problem and when running planned comparison on a glm model just doesn’t like to compare the mean between two different but “identical” (having exactly all the same values) populations.
    Any suggestion on how to deal with that? I assume the problem is due to all the zeros I have in the two population, but seems like even manipulating the data by adding a 1 (my dependent variable is binary) each the problem persist…so I guess planned comparisons in glm just really don’t like samples having identical values…
    (and no I can’t take out the comparison since is part of of the model)

    Reply
  8. Aziz says

    February 26, 2016 at 8:05 pm

    Hi Karen,

    Very useful post. I have data of more than 8000 observations. My dependent variable is a binary variable consists on 0 and 1 only. Around 97% of the dependent variable values are zero. When I plot residual histogram, I see a big spike around zero (towards negative side of the zero, i.e.,[-1,0] interval), and extremely small spikes left of the zero spike and right of the zero spike at residual value of around 2. I estimated GLM and GAM with logistic regression. Can I improve upon my estimation here, ideally to show me a symmetric histogram for residuals? Thank you. Kind Regards, Aziz

    Reply
  9. Neetu says

    January 25, 2016 at 5:40 am

    Hi.. I have a dichotomous dep var and covariates are categorical(sex, birth loc and another ones are 5 point likert scale variables). I am using binary logistic regression. Is it right?

    Reply
  10. Matthew says

    July 15, 2015 at 3:27 pm

    Hi Karen,

    Firstly thank you for this helpful article.

    I have three proportion DVs and most of the data are very close to the bounds and the proportions across the three DVs all add to 100%.

    Unfortunately I can’t find a way to include all three DVs in one analysis like you would do in multivariate. Do you think it would be ok to do three different sets of binomial regressions (events/trials) in GLM for each DV or am I risking bias/errors?

    Thanks.

    Reply
    • Karen says

      July 18, 2015 at 9:52 am

      Hi Matthew,

      Hmm, if they add to 100%, do you really need all three. If you know each persons’ answer to two, you know their answer to the third, right?

      This is a tricky one…

      Karen

      Reply
  11. Sherryll says

    March 9, 2015 at 1:27 pm

    Hi,
    i have panel data and the dependent variable is a calculated ratio. The explanatory variables consist of some macroeconomic indicators and other control variables. The aim of the analysis is to use variations in the Xs to explain the existence of cycles (given by fluctuations in Y). However, I am quite unsure about which model best fits the data. I’d be pleased to get a few suggestions.
    Thanks.

    Reply
    • Karen says

      March 23, 2015 at 12:29 pm

      Hi Sherryll,

      As always, it depends on all the details. Is it a ratio that is bounded anywhere? Some are bounded, for example at 0 and 1, but not all are.

      Reply
  12. jessica says

    February 15, 2015 at 2:47 pm

    What exactly constitutes a “huge spike in the distribution at 0”? Is there a numeric cut off, perhaps if 50% of the participants scored a zero then one should implement a zero inflated model?

    Reply
  13. Ian Potter says

    December 3, 2014 at 5:12 am

    Hi Karen,
    I apologise in advance if my question appears too rudimentary, but the statistics textbooks provide no clear answer, and my presence here is evidence of my research online.

    1. I have a series of likert scale questionnaires that I want to check for mean differences in scores according to the different conditions of my IV (x4 levels). I have carried out a reliability analysis; the Cronbach alpha’s are mostly acceptable. What is the acceptable way (my discipline is Social Science) to transform the different scores for each person for each questionnaire into a single score (is this what I have to do?)

    2. For my second study, the questionnaires were responded to first before participants were divided into different treatments of my IV ( x4 levels) followed by responding to two more questionnaires (my DVs). I predicted that high scores on the initial questionnaires will predict high scores on the DVs if IV level = W & X, and the inverse if IV level = Y & Z.
    Your response will be greatly appreciated. I am a self-funded doctorate student, a bit isolated, and statistics is not my adviser’s strongest point.

    Thanks,
    Ian Potter

    Reply
  14. Pisie says

    July 19, 2014 at 9:36 pm

    Hi, I like this post, I have a question, mi set data is only one numerical variable (egg per gram of feces) and the others are categorical (sex, category age, etc) can I make a GLMM with this set data?

    Reply
  15. Loreth says

    June 2, 2014 at 7:03 pm

    Hello,
    I have as a response variable minutes spent feeding, which looks skewed to the left. Someone suggested me using a Gamma distribution, but I am not too sure. Any suggestions? To put it into context I want to know whether diameter of a tree (DBH), tree species or/and season has an effect on the duration of an animal spends feeding on a specific tree. Thanks!

    Reply
  16. Endriyas says

    May 1, 2014 at 9:07 am

    Hi karen,
    I would like to know how likert scale can be changed to logistic regression?

    Reply
    • Karen says

      May 6, 2014 at 3:06 pm

      Hi Endriyas,

      The scale wouldn’t be changed (usually). You just need ordinal logistic regression. I would start with this webinar on logistic regression. It’s a free download.

      Reply
  17. Laura says

    March 10, 2014 at 2:40 pm

    Hi Karen,
    Thanks for the interesting post. I have proportional data that is zero-inflated – the proportion of a carivores’ diet consisting of small mammals, with lots of results recording no small mammals at all.
    Would this be a zero-inflated tobit regression? Is there such a thing??
    Thanks
    Laura

    Reply
    • Karen says

      March 10, 2014 at 4:57 pm

      Hi Laura,

      Sounds like it. I haven’t heard specifically of a zero-inflated tobit model, but it could exist. Zero inflated models are part of a class of models called mixture models, which combine two models.

      Reply
  18. Cory says

    July 29, 2013 at 2:55 pm

    I find this post very confusing – GLM typically stands for generalized linear models which were formulated as a way of unifying various other statistical models, including linear regression, logistic regression and Poisson regression.

    Reply
    • Karen says

      August 7, 2013 at 3:35 pm

      Hi Cory, it’s one of those confusing terms. GLM also stands for General Linear Model, which is what I meant here.

      Reply
  19. Stan says

    May 29, 2013 at 4:32 am

    Karen, thanks for the information you give here. It’s very helpful.
    I have a question. Is it correct to use a generalized linear mixed model when my data are percentages? Thank you.

    Stan

    Reply
    • Karen says

      June 6, 2013 at 5:25 pm

      It could be. Usually percentages (proportions, actually) either have to be considered binomial or possibly a beta distribution (although for a beta distributions, there can’t be any 0 or 1 proportions). Both would work in a GLMM.

      Reply
  20. Bereket.Y says

    February 14, 2013 at 5:54 am

    Can i use Poisson or one of count data model for the out comes of the dependent variable contains only o,1,2,3 and 4

    Reply
    • Karen says

      February 20, 2013 at 4:55 pm

      Potentially. Is it truly a count? You can always run in then check assumptions. If they’re met, then sure.

      Karen

      Reply

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Please note that, due to the large number of comments submitted, any questions on problems related to a personal study/project will not be answered. We suggest joining Statistically Speaking, where you have access to a private forum and more resources 24/7.

Primary Sidebar

Free Webinars

Effect Size Statistics on Tuesday, Feb 2nd

This Month’s Statistically Speaking Live Training

  • January Member Training: A Gentle Introduction To Random Slopes In Multilevel Models

Upcoming Workshops

  • Logistic Regression for Binary, Ordinal, and Multinomial Outcomes (May 2021)
  • Introduction to Generalized Linear Mixed Models (May 2021)

Read Our Book



Data Analysis with SPSS
(4th Edition)

by Stephen Sweet and
Karen Grace-Martin

Statistical Resources by Topic

  • Fundamental Statistics
  • Effect Size Statistics, Power, and Sample Size Calculations
  • Analysis of Variance and Covariance
  • Linear Regression
  • Complex Surveys & Sampling
  • Count Regression Models
  • Logistic Regression
  • Missing Data
  • Mixed and Multilevel Models
  • Principal Component Analysis and Factor Analysis
  • Structural Equation Modeling
  • Survival Analysis and Event History Analysis
  • Data Analysis Practice and Skills
  • R
  • SPSS
  • Stata

Copyright © 2008–2021 The Analysis Factor, LLC. All rights reserved.
877-272-8096   Contact Us

The Analysis Factor uses cookies to ensure that we give you the best experience of our website. If you continue we assume that you consent to receive cookies on all websites from The Analysis Factor.
Continue Privacy Policy
Privacy & Cookies Policy

Privacy Overview

This website uses cookies to improve your experience while you navigate through the website. Out of these, the cookies that are categorized as necessary are stored on your browser as they are essential for the working of basic functionalities of the website. We also use third-party cookies that help us analyze and understand how you use this website. These cookies will be stored in your browser only with your consent. You also have the option to opt-out of these cookies. But opting out of some of these cookies may affect your browsing experience.
Necessary
Always Enabled

Necessary cookies are absolutely essential for the website to function properly. This category only includes cookies that ensures basic functionalities and security features of the website. These cookies do not store any personal information.

Non-necessary

Any cookies that may not be particularly necessary for the website to function and is used specifically to collect user personal data via analytics, ads, other embedded contents are termed as non-necessary cookies. It is mandatory to procure user consent prior to running these cookies on your website.