• Skip to primary navigation
  • Skip to main content
  • Skip to primary sidebar
The Analysis Factor

The Analysis Factor

Statistical Consulting, Resources, and Statistics Workshops for Researchers

  • Home
  • Our Programs
    • Membership
    • Online Workshops
    • Free Webinars
    • Consulting Services
  • About
    • Our Team
    • Our Core Values
    • Our Privacy Policy
    • Employment
    • Collaborate with Us
  • Statistical Resources
  • Contact
  • Blog
  • Login

linear regression

Proportions as Dependent Variable in Regression–Which Type of Model?

by Karen Grace-Martin 13 Comments

When the dependent variable in a regression model is a proportion or a percentage, it can be tricky to decide on the appropriate way to model it.

The big problem with ordinary linear regression is that the model can predict values that aren’t possible–values below 0 or above 1.  But the other problem is that the relationship isn’t linear–it’s sigmoidal.  A sigmoidal curve looks like a flattened S–linear in the middle, but flattened on the ends.  So now what?

The simplest approach is to do a linear regression anyway.  This approach can be justified only in a few situations.

1. All your data fall in the middle, linear section of the curve.  This generally translates to all your data being between .2 and .8 (although I’ve heard that between .3-.7 is better).  If this holds, you don’t have to worry about the two objections.  You do have a linear relationship, and you won’t get predicted values much beyond those values–certainly not beyond 0 or 1.

2. It is a really complicated model that would be much harder to model another way.  If you can assume a linear model, it will be much easier to do, say, a complicated mixed model or a structural equation model.  If it’s just a single multiple regression, however, you should look into one of the other methods.

A second approach is to treat the proportion as a binary response then run a logistic or probit regression.  This will only work if the proportion can be thought of and you have the data for the number of successes and the total number of trials.  For example, the proportion of land area covered with a certain species of plant would be hard to think of this way, but the proportion of correct answers on a 20-answer assessment would.

The third approach is to treat it the proportion as a censored continuous variable.  The censoring means that you don’t have information below 0 or above 1.  For example, perhaps the plant would spread even more if it hadn’t run out of land.  If you take this approach, you would run the model as a two-limit tobit model (Long, 1997).  This approach works best if there isn’t an excessive amount of censoring (values of 0 and 1).

Reference: Long, J.S. (1997). Regression Models for Categorical and Limited Dependent Variables. Sage Publishing.

Tagged With: dependent variable, linear regression, logistic regression, percentage data, Proportion, Tobit Regression

Related Posts

  • Member Training: Types of Regression Models and When to Use Them
  • When Linear Models Don’t Fit Your Data, Now What?
  • When to Use Logistic Regression for Percentages and Counts
  • Member Training: Using Excel to Graph Predicted Values from Regression Models

Interpreting Interactions in Regression

by Karen Grace-Martin 31 Comments

Adding interaction terms to a regression model has real benefits. It greatly expands your understanding of the relationships among the variables in the model. And you can test more specific hypotheses.  But interpreting interactions in regression takes understanding of what each coefficient is telling you.

The example from Interpreting Regression Coefficients was a model of the height of a shrub (Height) based on the amount of bacteria in the soil (Bacteria) and whether the shrub is located in partial or full sun (Sun). Height is measured in cm, Bacteria is measured in thousand per ml of soil, and Sun = 0 if the plant is in partial sun, and Sun = 1 if the plant is in full sun.

[Read more…] about Interpreting Interactions in Regression

Tagged With: Interpreting Interactions, linear regression, regression coefficients

Related Posts

  • Using Marginal Means to Explain an Interaction to a Non-Statistical Audience
  • Understanding Interactions Between Categorical and Continuous Variables in Linear Regression
  • Clarifications on Interpreting Interactions in Regression
  • Interpreting Lower Order Coefficients When the Model Contains an Interaction

SPSS GLM: Choosing Fixed Factors and Covariates

by Karen Grace-Martin 87 Comments

The beauty of the Univariate GLM procedure in SPSS is that it is so flexible.  You can use it to analyze regressions, ANOVAs, ANCOVAs with all sorts of interactions, dummy coding, etc.

The down side of this flexibility is it is often confusing what to put where and what it all means.

So here’s a quick breakdown.

The dependent variable I hope is pretty straightforward.  Put in your continuous dependent variable.

Fixed Factors are categorical independent variables.  It does not matter if the variable is [Read more…] about SPSS GLM: Choosing Fixed Factors and Covariates

Tagged With: analysis of covariance, ancova, ANOVA, Covariate, dummy coding, Fixed Factor, linear regression, post hoc test, SPSS GLM

Related Posts

  • Dummy Coding in SPSS GLM–More on Fixed Factors, Covariates, and Reference Groups, Part 1
  • The General Linear Model, Analysis of Covariance, and How ANOVA and Linear Regression Really are the Same Model Wearing Different Clothes
  • Dummy Coding in SPSS GLM–More on Fixed Factors, Covariates, and Reference Groups, Part 2
  • Why ANOVA and Linear Regression are the Same Analysis

Centering for Multicollinearity Between Main effects and Quadratic terms

by Karen Grace-Martin 8 Comments

One of the most common causes of multicollinearity is when predictor variables are multiplied to create an interaction term or a quadratic or higher order terms (X squared, X cubed, etc.).

Why does this happen?  When all the X values are positive, higher values produce high products and lower values produce low products.  So the product variable is highly correlated with the component variable.  I will do a very simple example to clarify.  (Actually, if they are all on a negative scale, the same thing would happen, but the correlation would be negative).

In a small sample, say you have the following values of a predictor variable X, sorted in ascending order:

2, 4, 4, 5, 6, 7, 7, 8, 8, 8

It is clear to you that the relationship between X and Y is not linear, but curved, so you add a quadratic term, X squared (X2), to the model.  The values of X squared are:

4, 16, 16, 25, 49, 49, 64, 64, 64

The correlation between X and X2 is .987–almost perfect.

Plot of X vs. X squared
Plot of X vs. X squared

To remedy this, you simply center X at its mean.  The mean of X is 5.9.  So to center X, I simply create a new variable XCen=X-5.9.

These are the values of XCen:

-3.90, -1.90, -1.90, -.90, .10, 1.10, 1.10, 2.10, 2.10, 2.10

Now, the values of XCen squared are:

15.21, 3.61, 3.61, .81, .01, 1.21, 1.21, 4.41, 4.41, 4.41

The correlation between XCen and XCen2 is -.54–still not 0, but much more managable.  Definitely low enough to not cause severe multicollinearity.  This works because the low end of the scale now has large absolute values, so its square becomes large.

The scatterplot between XCen and XCen2 is:

Plot of Centered X vs. Centered X squared
Plot of Centered X vs. Centered X squared

If the values of X had been less skewed, this would be a perfectly balanced parabola, and the correlation would be 0.

Tonight is my free teletraining on Multicollinearity, where we will talk more about it.  Register to join me tonight or to get the recording after the call.

Tagged With: centering, Correlation, linear regression, Multicollinearity

Related Posts

  • What is Multicollinearity? A Visual Description
  • Should You Always Center a Predictor on the Mean?
  • When NOT to Center a Predictor Variable in Regression
  • Using Marginal Means to Explain an Interaction to a Non-Statistical Audience

Regression Through the Origin

by Karen Grace-Martin 3 Comments

I just wanted to follow up on my last post about Regression without Intercepts.Stage 2

Regression through the Origin means that you purposely drop the intercept from the model.  When X=0, Y must = 0.

The thing to be careful about in choosing any regression model is that it fit the data well.  Pretty much the only time that a regression through the origin will fit better than a model with an intercept is if the point X=0, Y=0 is required by the data.

Yes, leaving out the intercept will increase your df by 1, since you’re not estimating one parameter.  But unless your sample size is really, really small, it won’t matter.  So it really has no advantages.

Tagged With: linear regression, Regression through the origin

Related Posts

  • Regression models without intercepts
  • The Difference Between R-squared and Adjusted R-squared
  • What is Multicollinearity? A Visual Description
  • Removing the Intercept from a Regression Model When X Is Continuous

Regression models without intercepts

by Karen Grace-Martin 8 Comments

Stage 2A recent question on the Talkstats forum asked about dropping the intercept in a linear regression model since it makes the predictor’s coefficient stronger and more significant.  Dropping the intercept in a regression model forces the regression line to go through the origin–the y intercept must be 0.

The problem with dropping the intercept is if the slope is steeper just because you’re forcing the line through the origin, not because it fits the data better.  If the intercept really should be something else, you’re creating that steepness artificially.  A more significant model isn’t better if it’s inaccurate.

Tagged With: linear regression, Regression through the origin

Related Posts

  • Regression Through the Origin
  • The Difference Between R-squared and Adjusted R-squared
  • What is Multicollinearity? A Visual Description
  • Removing the Intercept from a Regression Model When X Is Continuous

  • « Go to Previous Page
  • Go to page 1
  • Interim pages omitted …
  • Go to page 7
  • Go to page 8
  • Go to page 9
  • Go to page 10
  • Go to Next Page »

Primary Sidebar

This Month’s Statistically Speaking Live Training

  • Member Training: A Gentle Introduction to Bootstrapping

Upcoming Free Webinars

Getting Started with R
3 Overlooked Strengths of Structural Equation Modeling
4 Critical Steps in Building Linear Regression Models

Upcoming Workshops

    No Events

Copyright © 2008–2022 The Analysis Factor, LLC. All rights reserved.
877-272-8096   Contact Us

The Analysis Factor uses cookies to ensure that we give you the best experience of our website. If you continue we assume that you consent to receive cookies on all websites from The Analysis Factor.
Continue Privacy Policy
Privacy & Cookies Policy

Privacy Overview

This website uses cookies to improve your experience while you navigate through the website. Out of these, the cookies that are categorized as necessary are stored on your browser as they are essential for the working of basic functionalities of the website. We also use third-party cookies that help us analyze and understand how you use this website. These cookies will be stored in your browser only with your consent. You also have the option to opt-out of these cookies. But opting out of some of these cookies may affect your browsing experience.
Necessary
Always Enabled
Necessary cookies are absolutely essential for the website to function properly. This category only includes cookies that ensures basic functionalities and security features of the website. These cookies do not store any personal information.
Non-necessary
Any cookies that may not be particularly necessary for the website to function and is used specifically to collect user personal data via analytics, ads, other embedded contents are termed as non-necessary cookies. It is mandatory to procure user consent prior to running these cookies on your website.
SAVE & ACCEPT