Linear Regression

Online Workshops

Articles at The Analysis Factor

  • 3 Mistakes Data Analysts Make in Testing Assumptions in GLM - I know you know it--those assumptions in your regression or ANOVA model really are important. If they're not met adequately, all your p-values are inaccurate, wrong, useless. But, and this is a big one, the GLM is robust to departures from those assumptions. Meaning, they don't have to fit exactly to be accurate, right, useful.
  • 3 Reasons Psychology Researchers should Learn Regression - Why should you, as a researcher in Psychology, Education, or Agriculture, who is trained in ANOVA, need to learn linear regression? There are 3 main reasons.
  • 6 Types of Dependent Variables that will Never Meet the GLM Normality Assumption - Sometimes it's because the dependent variable just isn't appropriate for a GLM. The dependent variable, Y, doesn't have to be normal for the residuals to be normal (since Y is affected by the X's). But Y does have to be continuous, unbounded, and measured on an interval or ratio scale.
  • 7 Practical Guidelines for Accurate Statistical Model Building - But if the point is to answer a research question that describes relationships, you're going to have to get your hands dirty. It's easy to say "use theory" or "test your research question" but that ignores a lot of practical issues. Like the fact that you may have 10 different variables that all measure the same theoretical construct, and it's not clear which one to use.
  • A Primer in Matrix Algebra for Data Analysts Webinar - At the center of multivariate statistical methods is the simultaneous consideration of multiple variables and the inherent complexity it introduces. Matrix/Linear algebra is a mathematical method particularly well-suited to dealing with multiple variables...
  • About Dummy Variables in SPSS Analysis - I know that if I included 5 dummy location variables (6 locations in total, with A as the reference group) in 1 block of the regression analysis, the result would be based on the comparison with the reference location. Then what if I put 6 dummies (for example, the 1st dummy would be "1" for A location, and "0" for otherwise) in 1 block? Will it be a bug? If not, how to interpret the result?
  • Anatomy of a Normal Probability Plot - Across the bottom are the observed data values, sorted lowest to highest. You can see that just like on the histogram, the values range from about -2.2 to 2.2. (Note, these are standardized residuals, so they already have a mean of 0 and a standard deviation of 1. If they didn’t, the plot would standardize them before plotting).
  • ANCOVA Assumptions: When Slopes are Unequal - Of course, the main effect for condition in this full model with the interaction will test the same thing, as well as give you additional information at different ages. So your second option is:
  • Announcing The Analysis Institute, a new Statistical Resource - Ta-Da! I am happy to announce the unveiling our our newest web site....The Analysis Institute! My team and I have been working hard all summer to bring you this latest opportunity. At The Analysis Institute, you'll find our newest applied statistics training programs, including two brand new fall workshops, some ebooks, and home study workshops.
  • Answers to the Interpreting Regression Coefficients Quiz - Yesterday I gave a little quiz about interpreting regression coefficients.  Today I’m giving you the answers. If you want to try it yourself before you see the answers, go here.  (It’s truly little, but if you’re like me, you just cannot resist testing yourself). True or False? 1. When you add an interaction to a […]
  • Assessing the Fit of Regression Models - A well-fitting regression model results in predicted values close to the observed data values. The mean model, which uses the mean for every predicted value, generally would be used if there were no informative predictor variables. The fit of a proposed regression model should therefore be better than the fit of the mean model. Three […]
  • Assumptions of Linear Models are about Residuals, not the Response Variable - I recently received a great question in a comment about whether the assumptions of normality, constant variance, and independence in linear models are about the residuals or the response variable. The asker had a situation where Y, the response, was not normally distributed, but the residuals were.
  • Beyond Median Splits: Meaningful Cut Points - It's true that median splits are arbitrary, but there are situations when it is reasonable to make a continuous numerical variable categorical.
  • Can a Regression Model with a Small R-squared Be Useful? - R² is such a lovely statistic, isn't it? Unlike so many of the others, it makes sense--the percentage of variance in Y accounted for by a model. I mean, you can actually understand that. So can your grandmother. And the clinical audience you're writing the report for. A big R² is always big (and good!) and a small one is always small (and bad!), right? Well, maybe.
  • Can Likert Scale Data ever be Continuous? - A very common question is whether it is legitimate to use Likert scale data in parametric statistical procedures that require interval data, such as Linear Regression, ANOVA, and Factor Analysis. A typical Likert scale item has 5 to 11 points that indicate the degree of agreement with a statement, such as 1=Strongly Agree to 5=Strongly […]
  • Centering and Standardizing Predictors - I was recently asked about whether centering (subtracting the mean) a predictor variable in a regression model has the same effect as standardizing (converting it to a Z score).  My response: They are similar, but not the same. In centering, you are changing the values, but not the scale.  So a predictor that is centered […]
  • Centering for Multicollinearity Between Main effects and Quadratic terms - One of the most common causes of multicollinearity is when predictor variables are multiplied to create an interaction term or a quadratic or higher order terms (X squared, X cubed, etc.). Why does this happen?  When all the X values are positive, higher values produce high products and lower values produce low products.  So the […]
  • Checking Assumptions in ANOVA and Linear Regression Models: The Distribution of Dependent Variables - The distributional assumptions for linear regression and ANOVA are for the distribution of Y|X -- that's Y given X. You have to take out the effects of all the Xs before you look at the distribution of Y.
  • Checking the Normality Assumption for an ANOVA Model - The assumptions are exactly the same for ANOVA and regression models. The normality assumption is that residuals follow a normal distribution. You usually see it like this: ε~ i.i.d. N(0, σ²) But what it's really getting at is the distribution of Y|X.
  • Clarifications on Interpreting Interactions in Regression - In a previous post, Interpreting Interactions in Regression, I said the following: In our example, once we add the interaction term, our model looks like: Height = 35 + 4.2*Bacteria + 9*Sun + 3.2*Bacteria*Sun Adding the interaction term changed the values of B1 and B2. The effect of Bacteria on Height is now 4.2 + […]
  • Concepts in Linear Regression you need to know before learning Multilevel Models - Learning Multilevel Models is hard enough. Even worse if you don't have a good background in linear regression. This article outlines 4 must-know concepts.
  • Confusing Statistical Term #4: Hierarchical Regression vs. Hierarchical Model - This one is relatively simple.  Very similar names for two totally different concepts. Hierarchical Models (aka Hierarchical Linear Models or HLM) are a type of linear regression models in which the observations fall into hierarchical, or completely nested levels. Hierarchical Models are a type of Multilevel Models. So what is a hierarchical data structure, which […]
  • Confusing Statistical Term #7: GLM - Like some of the other terms in our list--level and beta--GLM has two different meanings. It's a little different than the others, though, because it's an abbreviation for two different terms: General Linear Model and Generalized Linear Model. It's extra confusing because their names are so similar on top of having the same abbreviation.
  • Confusing Statistical Terms #1: The Many Names of Independent Variables - Statistical models, such as general linear models (linear regression, ANOVA, mixed models) and generalized linear models (logistic, Poisson, proportional hazard regression, etc.) all have the same general form.  On the left side of the equation is one or more response variables, Y.  On the right hand side is one or more predictor variables, X,  and […]
  • Continuous and Categorical Variables: The Trouble with Median Splits - A Median Split is one method for turning a continuous variable into a categorical one.  Essentially, the idea is to find the median of the continuous variable.  Any value below the median is put it the category “Low” and every value above it is labeled “High.” This is a very common practice in many social […]
  • Correlated Predictors in Regression Models: What is Multicollinearity and How to Detect it Webinar - The next The Craft of Statistical Analysis Webinar is: Correlated Predictors in Regression Models: What is Multicollinearity and How to Detect it There’s nothing like multicollinearity to strike fear into the heart of any regression modeler. But true multicollinearity is pretty rare, and correlations among predictors are not good metrics of multicollinearity. Join us in […]
  • Dummy Coding in SPSS GLM–More on Fixed Factors, Covariates, and Reference Groups, Part 1 - So the question is what to do with your categorical variables. You have two choices, and each has advantages and disadvantages. The easiest is to put categorical variables in Fixed Factors. SPSS will dummy code those variables for you, which is quite convenient if your categorical variable has more than two categories. However, there are some defaults you need to be aware of that may or may not make this a good choice. SPSS always makes the reference group the one that comes last alphabetically. So if the values you input are strings, it will be the one that comes last. If those values are numbers, it will be the highest one.
  • Dummy Coding in SPSS GLM–More on Fixed Factors, Covariates, and Reference Groups, Part 2 - Yesterday’s post outlined one issue in deciding whether to put a categorical predictor variable into Fixed Factors or Covariates in SPSS GLM.  That issue dealt with how SPSS automatically creates dummy variables out of any variable in Fixed Factors. Another default to keep in mind is that SPSS will automatically create interactions between any and […]
  • Five Common Relationships Among Three Variables in a Statistical Model - Including Z in the model often leads to the relationship between X and Y becoming more significant because Z has explained some of the otherwise unexplained variance in Y. An example of this kind of covariate is when an experimental manipulation (X) on response time (Y) only becomes significant when we control for finger dexterity levels (Z).
  • Get Started with SPSS: Tutorial Videos - If you’re just getting started using SPSS, here’s a nice series of SPSS video tutorials, created by Dr. Ian Walker at the University of Bath. They cover many of the basics: histograms, Two-sample t-tests, Mann Whitney U tests, one-way anova, regression, etc. They’re nice because not only does he show you how to do them […]
  • GLM in SPSS: Centering a Covariate to Improve Interpretability - The reason for centering a continuous covariate is that it can improve interpretability. For example, say you had one categorical predictor with 4 categories and one continuous covariate, plus an interaction between them. First, you’ll notice that if you center your covariate at the mean, there is
  • Have you Wondered how using SPSS Burns Calories? - Number 4: This morning, I received an email listing some interesting facts, among them: "Banging your head against a wall burns 150 calories an hour." I'm pretty sure that one is not specifically about SPSS, but it could be.
  • Help me plan my spring statistics workshops - Can I ask you a favor? I am planning our spring statistics workshops.  As always, we’re getting creative to find ways to bring you the statistical support you need easily and efficiently. I found a great service that will allow me to do workshops via webcast, so you can participate from home or office–no travel […]
  • How Simple Should a Model Be? The Case of Insignificant Controls, Interactions, and Covariance Structures - “Everything should be made as simple as possible, but no simpler” – Albert Einstein* For some reason, I’ve heard this quotation 3 times in the past 3 days.  Maybe I hear it everyday, but only noticed because I’ve been working with a few clients on model selection, and deciding how much to simplify a model. […]
  • How to Combine Complicated Models with Tricky Effects - You're dealing with both a complicated modeling technique (survival analysis, logistic regression, multilevel modeling) and tricky effects in the model (dummy coding, interactions, and quadratic terms). The only way to figure it all out in a situation like that is to break it down into parts. Trying to understand all those complicated parts together is a recipe for disaster. But if you can do linear regression, each part is just one step up in complexity. Take one step at a time.
  • How to Get Standardized Regression Coefficients When Your Software Doesn’t Want To Give Them To You - Remember all those Z-scores you had to calculate in Intro Stats? Converting a variable to a Z-score is standardizing. In other words, do these steps for Y, your outcome variable, and every X, your predictors: 1. Calculate the mean and standard deviation.
  • How to Interpret the Intercept in 6 Linear Regression Examples - In all linear regression models, the intercept has the same definition: the mean of the response, Y, when all predictors, all X = 0.
  • Interpreting (Even Tricky) Regression Coefficients – A Quiz - Here’s a little quiz: True or False? 1. When you add an interaction to a regression model, you can still evaluate the main effects of the terms that make up the interaction, just like in ANOVA. 2. The intercept is usually meaningless in a regression model.
  • Interpreting Interactions Between Two Effect-Coded Categorical Predictors - I recently received this great question: Question: Hi Karen,  ive purchased a lot of your material and read a lot of your pdf documents w.r.t. regression and interaction terms.  Its, now, my general understanding that interaction for two or more categorical variables is best done with effects coding, and interactions  cont v. categorical variables is […]
  • Interpreting Interactions in Regression - Adding interaction terms to a regression model can greatly expand understanding of the relationships among the variables in the model and allows more hypotheses to be tested. The example from Interpreting Regression Coefficients was a model of the height of a shrub (Height) based on the amount of bacteria in the soil (Bacteria) and whether […]
  • Interpreting Lower Order Coefficients When the Model Contains an Interaction - A Linear Regression Model with an interaction between two predictors changes the meaning of the coefficients for the lower order terms--the predictors that are involved in the interaction. They need to be interpreted differently.
  • Interpreting Regression Coefficients - Linear regression is one of the most popular statistical techniques used by researchers. Despite its popularity, interpretation of the regression coefficients of any but the simplest models is sometimes difficult. This article explains how to interpret the coefficients of continuous and categorical variables.  Although the example used here is a linear regression model with two […]
  • Interpreting Regression Coefficients in Models other than Ordinary Linear Regression - So this is the actual model for an ordinary least squares linear regression. The left hand side of the equation is just Y and ε, the error term, has a normal distribution. For other types of regression models, like logistic regression, Poisson regression, or multilevel models, all the βs and Xs stay the same. The only parts that can differ: 1. Instead of Y on the left, there can be a function of Y--a non-linear transformation. 2. Instead of a normal distribution, the residuals can have another distribution.
  • Interpreting Regression Coefficients Teleseminar is Wednesday - Just a reminder that our free monthly teleseminar is tomorrow.  The topic this month is Interpreting Regression Coefficients: A Walk Through Output. Always free, but you have to register… http://www.theanalysisfactor.com/learning/teletraining4.html. This month we’ll be doing two things differently–we’re trying out a new webinar system, adding visuals; and we’re going over the actual output of one […]
  • Interpreting Regression Coefficients: Changing the scale of predictor variables - Sometimes it makes sense to change the scale of predictor variables so that interpretations of parameter estimates, including odds ratios, make sense.  It is generally done by multiplying the values of a predictor by a constant, often a factor of 10. Since parameter estimates and odds ratios tell you the effect of a one unit […]
  • Interpreting the Intercept in a Regression Model - How do you interpret the intercept in a regression model? The intercept is the expected mean value of Y when all X=0. This has different meanings, depending on the scale of X.
  • Is Multicollinearity the Bogeyman? - Multicollinearity occurs when two or more predictor variables in a regression model are redundant. It is a real problem, and it can do terrible things to your results. But it is uncommon, and is often misdiagnosed.
  • Likert Scale Items as Predictor Variables in Regression - I was recently asked about whether it's okay to treat a likert scale as continuous in a regression model. Here's my reply.
  • Linear Regression Analysis – 3 Common Causes of Multicollinearity and What Do to About Them - Multicollinearity is simply redundancy in the information contained in predictor variables. If the redundancy is moderate, it usually only affects the interpretation of regression coefficients. But if it is severe-at or near perfect redundancy, it causes the model to "blow up." (And yes, that's a technical term). But the reality is that there are only five situations where it commonly occurs. And three of them have very simple solutions.
  • Making Dummy Codes Easy to Keep Track of - Here’s a little tip. When you construct Dummy Variables, make it easy on yourself  to remember which code is which.  Heck, if you want to be really nice, make it easy for anyone else who will analyze the data or read the results. Make the codes inherent in the Dummy variable name. So instead of […]
  • May 2015 Membership Webinar: Transformations & Nonlinear Effects in Linear Models - Why is it we can model non-linear effects in linear regression? What the heck does it mean for a model to be “linear in the parameters?” In this webinar we will explore a number of ways of using a linear regression to model a non-linear effect between X and Y.
  • Multiple Regression Model: Univariate or Multivariate GLM? - A regression analysis with one dependent variable and 8 independent variables is NOT a multivariate regression. It's a multiple regression. Multivariate analysis ALWAYS refers to the dependent variable.
  • Nate Silver is Making Statistics Cool - I just heard about FiveThirtyEight.com (from a friend who barely understands a word of it). It’s a blog in which Nate Silver has been projecting election results using statistical projections and an incredibly thorough use of polling data, both for current and historical elections.  Now that it’s mostly over, the amazing accuracy of his predictions […]
  • One-tailed and two-tailed tests - I was recently asked about when to use one and two tailed tests. The long answer is:  Use one tailed tests when you have a specific hypothesis about the direction of your relationship.  Some examples include you hypothesize that one group mean is larger than the other; you hypothesize that the correlation is positive; you […]
  • Online Statistics Workshop : Running Regressions and ANOVAs in SPSS GLM is now available on demand - This focus of this workshop is learning all the in's and out's of SPSS GLM. It's about regression, yes, and ANCOVA, yes. But mostly it's about SPSS GLM. It's about mastering the tool--the statistical software--so you can do the analysis without second guessing yourself.
  • Poisson Regression Analysis for Count Data - There are many dependent variables that no matter how many transformations you try, you can not get to be normally distributed.  The most common culprit are count variables–the variable measures the count or rate of some event in a sample.  Some examples I’ve seen from a variety of disciplines are: Number of eggs in a […]
  • Problems Caused by Categorizing Continuous Variables - I just came across this great article by Frank Harrell:  Problems Caused by Categorizing Continuous Variables It’s from the Vanderbilt University biostatistics department, so the examples are all medical, but the points hold for any field. It goes right along with my recent post, Continuous and Categorical Variables: The Trouble with Median Splits.
  • Proportions as Dependent Variable in Regression–Which Type of Model? - Proportions as dependent variables can be tricky. You can run a linear regression model, a logistic regression model, or a tobit model, depending on your data and variables.
  • Regression Diagnostics: Resources for Multicollinearity - In preparing for my teletraining this week, I’ve been researching some good resources on multicollinearity.  In only an hour, I can’t go into as much detail as these resources contain (we have a lot to cover!). But in case you miss the call or just can’t get enough, these are some great resources: Regression Diagnostics: […]
  • Regression models without intercepts - A recent question on the Talkstats forum asked about dropping the intercept in a linear regression model, since it makes the predictor’s coefficient stronger and more significant.  Dropping the intercept in a regression model forces the regression line to go through the origin–the y intercept must be 0. The problem with dropping the intercept is […]
  • Regression Models:How do you know you need a polynomial? - A polynomial term–a quadratic (squared) or cubic (cubed) term turns a linear regression model into a curve.  But because it is X that is squared or cubed, not the Beta coefficient, it still qualifies as a linear model.  This makes it a nice, straighforward way to model curves without having to model complicated non-linear models. […]
  • Regression Through the Origin - I just wanted to follow up on my last post about Regression without Intercepts. Regression through the Origin means that you purposely drop the intercept from the model.  When X=0, Y must = 0. The thing to be careful about in choosing any regression model is that it fit the data well.  Pretty much the […]
  • Should You Always Center a Predictor on the Mean? - One problem is that the mean age at which infants utter their first word may differ from one sample to another. This means you're not always evaluating that mean that the exact same age. It's not comparable across samples. So another option is to choose a meaningful value of age that is within the values in the data set. One example may be at 12 months.
  • SPSS GLM: Choosing Fixed Factors and Covariates - The beauty of the Univariate GLM procedure in SPSS is that it is so flexible.  You can use it to analyze regressions, ANOVAs, ANCOVAs with all sorts of interactions, dummy coding, etc. The down side of this flexibility is it is often confusing what to put where and what it all means. So here’s a […]
  • Testing and Dropping Interaction Terms in Regression and ANOVA models - In an ANOVA or regression model, should you drop interaction terms if they're not significant? As with everything in statistics, it depends.
  • The 13 Steps for Statistical Modeling in any Regression or ANOVA - No matter what statistical model you’re running, you need to go through the same 13 steps.  The order and the specifics of how you do each step will differ depending on the data and the type of model you use. These 13 steps are in 3 major parts.  Most people think of only Part 3 […]
  • The Assumptions of Linear Models: Explicit and Implicit - These assumptions are explicitly stated by the model: 1.The residuals are independent 2.The residuals are normally distributed 3.The residuals have a mean of 0 at all values of X 4.The residuals have constant variance
  • The Difference Between Interaction and Association - Interaction is different. Whether two variables are associated says nothing about whether they interact in their effect on a third variable. Likewise, if two variables interact, they may or may not be associated.
  • The Distribution of Independent Variables in Regression Models - While there are a number of distributional assumptions in regression models, one distribution that has no assumptions is that of any predictor (i.e. independent) variables. It’s because regression models are directional. In a correlation, there is no direction–Y and X are interchangeable. If you switched them, you’d get the same correlation coefficient. But regression is […]
  • The Exposure Variable in Poisson Regression Models - Poisson Regression Models and its extensions (Zero-Inflated Poisson, Negative Binomial Regression, etc.) are used to model counts and rates. A few examples of count variables include: – Number of words an eighteen month old can say – Number of aggressive incidents performed by patients in an impatient rehab center Most count variables follow one of […]
  • The General Linear Model, Analysis of Covariance, and How ANOVA and Linear Regression Really are the Same Model Wearing Different Clothes - But that's really just one application of a linear model with one categorical and one continuous predictor. The research question of interest doesn't have to be about the categorical predictor, and the covariate doesn't have to be a nuisance variable. A regression model with one continuous and one dummy variable is the same model (actually, you'd need two dummy variables to cover the three categories, but that's another story).
  • The Impact of Removing the Constant from a Regression Model: The Categorical Case - In a simple linear regression model how the constant (aka, intercept) is interpreted depends upon the type of predictor (independent) variable. If the predictor is categorical and dummy-coded, the constant is the mean value of the outcome variable for the reference category only. If the predictor variable is continuous, the constant equals the predicted value of the outcome variable when the predictor variable equals zero.
  • Understanding Interaction Between Dummy Coded Categorical Variables in Linear Regression - The concept of a statistical interaction is one of those things that seems very abstract. If you’re like me, you’re wondering: What in the world is meant by “the relationship among three or more variables”?
  • Understanding Mediation and Path Analysis - The Next The Craft of Statistical Analysis Webinar* is tomorrow: Understanding Mediation and Path Analysis Path Analysis is a system of regression equations used to determine if a third variable (a mediator) is driving the relationship between an independent and dependent variable. It is one of the simplest forms of structural equation models (SEM), but […]
  • Using Adjusted Means to Interpret Moderators in Analysis of Covariance - If you use the menus in SPSS, you can only get those EMMeans at the Covariate's mean, which in this example is about 25, where the vertical black line is. This isn't very useful for our purposes. But we can change the value of the covariate at which to compare the means using syntax.
  • When Assumptions of ANCOVA are Irrelevant - Every once in a while, I work with a client who is stuck between a particular statistical rock and hard place. It happens when they're trying to run an analysis of covariance (ANCOVA) model because they have a categorical independent variables and a continuous covariate. The problem arises when a coauthor, committee member, or reviewer insists that ANCOVA is inappropriate in this situation because one of the following ANCOVA assumptions are not met: (1) The independent variable and the covariate are independent of each other (2) There is no interaction between independent variable and the covariate.
  • When Dependent Variables Are Not Fit for Linear Models, Now What? - When your dependent variable is not continuous, unbounded, and measured on an interval or ratio scale, your model will never meet the Assumptions of the General Linear Model (GLM).  Today I’m going to go into more detail about these 6 common types of dependent variables, and the tests that work instead. Categorical Variables, including both […]
  • When Dummy Codes are Backwards, Your Stat Software may be Messing With You - In SAS proc glm, when you specify a predictor as categorical in the CLASS statement, it will automatically dummy code it for you in the parameter estimates table (the regression coefficients). The default reference category--what GLM will code as 0--is the highest value. This works just fine if your values are coded 1, 2, and 3. But if you've dummy coded them already, it's switching them on you.
  • When NOT to Center a Predictor Variable in Regression - Centering is often used to improve the interpretability of regression coefficients. When should a data analyst not center? This article gives 3 necessary conditions.
  • When to Check Model Assumptions - If any of these fail, it’s nearly impossible to get normally distributed residuals, even with remedial transformations. Types of variables that will generally fail these criteria include: Categorical Variables, both nominal and ordinal. Count Variables, which are often distributed as Poisson or Negative Binomial.
  • Why ANOVA and Linear Regression are the Same Analysis - ANOVA and Linear Regression are not only related, they're the same thing. Not a quarter and a nickel--different sides of the same coin. This article shows why.