3 Mistakes Data Analysts Make in Testing Assumptions in GLM-I know you know it--those assumptions in your regression or ANOVA model really are important. If they're not met adequately, all your p-values are inaccurate, wrong, useless.
But, and this is a big one, the GLM is robust to departures from those assumptions. Meaning, they don't have to fit exactly to be accurate, right, useful.
7 Practical Guidelines for Accurate Statistical Model Building-But if the point is to answer a research question that describes relationships, you're going to have to get your hands dirty.
It's easy to say "use theory" or "test your research question" but that ignores a lot of practical issues. Like the fact that you may have 10 different variables that all measure the same theoretical construct, and it's not clear which one to use.
A Primer in Matrix Algebra for Data Analysts Webinar-At the center of multivariate statistical methods is the simultaneous consideration of multiple variables and the inherent complexity it introduces. Matrix/Linear algebra is a mathematical method particularly well-suited to dealing with multiple variables...
About Dummy Variables in SPSS Analysis-I know that if I included 5 dummy location variables (6 locations in total, with A as the reference group) in 1 block of the regression analysis, the result would be based on the comparison with the reference location. Then what if I put 6 dummies (for example, the 1st dummy would be "1" for A location, and "0" for otherwise) in 1 block? Will it be a bug? If not, how to interpret the result?
Anatomy of a Normal Probability Plot-Across the bottom are the observed data values, sorted lowest to highest. You can see that just like on the histogram, the values range from about -2.2 to 2.2. (Note, these are standardized residuals, so they already have a mean of 0 and a standard deviation of 1. If they didn’t, the plot would standardize them before plotting).
ANCOVA Assumptions: When Slopes are Unequal-Of course, the main effect for condition in this full model with the interaction will test the same thing, as well as give you additional information at different ages. So your second option is:
Answers to the Interpreting Regression Coefficients Quiz-Yesterday I gave a little quiz about interpreting regression coefficients. Today I’m giving you the answers. If you want to try it yourself before you see the answers, go here. (It’s truly little, but if you’re like me, you just cannot resist testing yourself). True or False? 1. When you add an interaction to a […]
April 2017 Member Webinar: Segmented Regression-Linear regression with a continuous predictor is set up to measure the constant relationship between that predictor and a continuous outcome. This relationship is measured in the expected change in the outcome for each one-unit change in the predictor. One big assumption in this kind of model, though, is that this rate of change is the same for every value of the predictor. It's an assumption we need to question, though, because it's not a good approach for a lot of relationships. Segmented regression allows you to generate different slopes and/or intercepts for different segments of values of the continuous predictor. This can provide you with a wealth of information that a non-segmented regression cannot.
Assessing the Fit of Regression Models-A well-fitting regression model results in predicted values close to the observed data values. The mean model, which uses the mean for every predicted value, generally would be used if there were no informative predictor variables. The fit of a proposed regression model should therefore be better than the fit of the mean model. Three […]
Assumptions of Linear Models are about Residuals, not the Response Variable-I recently received a great question in a comment about whether the assumptions of normality, constant variance, and independence in linear models are about the residuals or the response variable.
The asker had a situation where Y, the response, was not normally distributed, but the residuals were.
Can a Regression Model with a Small R-squared Be Useful?-R² is such a lovely statistic, isn't it? Unlike so many of the others, it makes sense--the percentage of variance in Y accounted for by a model.
I mean, you can actually understand that. So can your grandmother. And the clinical audience you're writing the report for.
A big R² is always big (and good!) and a small one is always small (and bad!), right?
Can Likert Scale Data ever be Continuous?-A very common question is whether it is legitimate to use Likert scale data in parametric statistical procedures that require interval data, such as Linear Regression, ANOVA, and Factor Analysis. A typical Likert scale item has 5 to 11 points that indicate the degree of agreement with a statement, such as 1=Strongly Agree to 5=Strongly […]
Centering and Standardizing Predictors-I was recently asked about whether centering (subtracting the mean) a predictor variable in a regression model has the same effect as standardizing (converting it to a Z score). My response: They are similar but not the same. In centering, you are changing the values but not the scale. So a predictor that is centered […]
Centering for Multicollinearity Between Main effects and Quadratic terms-One of the most common causes of multicollinearity is when predictor variables are multiplied to create an interaction term or a quadratic or higher order terms (X squared, X cubed, etc.). Why does this happen? When all the X values are positive, higher values produce high products and lower values produce low products. So the […]
Checking the Normality Assumption for an ANOVA Model-The assumptions are exactly the same for ANOVA and regression models. The normality assumption is that residuals follow a normal distribution. You usually see it like this:
ε~ i.i.d. N(0, σ²)
But what it's really getting at is the distribution of Y|X.
Clarifications on Interpreting Interactions in Regression-In a previous post, Interpreting Interactions in Regression, I said the following: In our example, once we add the interaction term, our model looks like: Height = 35 + 4.2*Bacteria + 9*Sun + 3.2*Bacteria*Sun Adding the interaction term changed the values of B1 and B2. The effect of Bacteria on Height is now 4.2 + […]
Confusing Statistical Term #4: Hierarchical Regression vs. Hierarchical Model-This one is relatively simple. Very similar names for two totally different concepts. Hierarchical Models (aka Hierarchical Linear Models or HLM) are a type of linear regression models in which the observations fall into hierarchical, or completely nested levels. Hierarchical Models are a type of Multilevel Models. So what is a hierarchical data structure, which […]
Confusing Statistical Term #7: GLM-Like some of the other terms in our list--level and beta--GLM has two different meanings.
It's a little different than the others, though, because it's an abbreviation for two different terms:
General Linear Model and Generalized Linear Model.
It's extra confusing because their names are so similar on top of having the same abbreviation.
Confusing Statistical Terms #1: The Many Names of Independent Variables-Statistical models, such as general linear models (linear regression, ANOVA, mixed models) and generalized linear models (logistic, Poisson, proportional hazard regression, etc.) all have the same general form. On the left side of the equation is one or more response variables, Y. On the right hand side is one or more predictor variables, X, and […]
Continuous and Categorical Variables: The Trouble with Median Splits-A Median Split is one method for turning a continuous variable into a categorical one. Essentially, the idea is to find the median of the continuous variable. Any value below the median is put it the category “Low” and every value above it is labeled “High.” This is a very common practice in many social […]
Dummy Coding in SPSS GLM–More on Fixed Factors, Covariates, and Reference Groups, Part 1-So the question is what to do with your categorical variables. You have two choices, and each has advantages and disadvantages.
The easiest is to put categorical variables in Fixed Factors. SPSS will dummy code those variables for you, which is quite convenient if your categorical variable has more than two categories. However, there are some defaults you need to be aware of that may or may not make this a good choice.
SPSS always makes the reference group the one that comes last alphabetically. So if the values you input are strings, it will be the one that comes last. If those values are numbers, it will be the highest one.
Dummy Coding in SPSS GLM–More on Fixed Factors, Covariates, and Reference Groups, Part 2-Yesterday’s post outlined one issue in deciding whether to put a categorical predictor variable into Fixed Factors or Covariates in SPSS GLM. That issue dealt with how SPSS automatically creates dummy variables out of any variable in Fixed Factors. Another default to keep in mind is that SPSS will automatically create interactions between any and […]
Five Common Relationships Among Three Variables in a Statistical Model-Including Z in the model often leads to the relationship between X and Y becoming more significant because Z has explained some of the otherwise unexplained variance in Y.
An example of this kind of covariate is when an experimental manipulation (X) on response time (Y) only becomes significant when we control for finger dexterity levels (Z).
GLM in SPSS: Centering a Covariate to Improve Interpretability-The reason for centering a continuous covariate is that it can improve interpretability. For example, say you had one categorical predictor with 4 categories and one continuous covariate, plus an interaction between them. First, you’ll notice that if you center your covariate at the mean, there is
Have you Wondered how using SPSS Burns Calories?-Number 4: This morning, I received an email listing some interesting facts, among them: "Banging your head against a wall burns 150 calories an hour." I'm pretty sure that one is not specifically about SPSS, but it could be.
Help me plan my spring statistics workshops-Can I ask you a favor? I am planning our spring statistics workshops. As always, we’re getting creative to find ways to bring you the statistical support you need easily and efficiently. I found a great service that will allow me to do workshops via webcast, so you can participate from home or office–no travel […]
How to Combine Complicated Models with Tricky Effects-You're dealing with both a complicated modeling technique (survival analysis, logistic regression, multilevel modeling) and tricky effects in the model (dummy coding, interactions, and quadratic terms).
The only way to figure it all out in a situation like that is to break it down into parts. Trying to understand all those complicated parts together is a recipe for disaster.
But if you can do linear regression, each part is just one step up in complexity. Take one step at a time.
Interpreting (Even Tricky) Regression Coefficients – A Quiz-Here’s a little quiz: True or False? 1. When you add an interaction to a regression model, you can still evaluate the main effects of the terms that make up the interaction, just like in ANOVA. 2. The intercept is usually meaningless in a regression model.
Interpreting Interactions Between Two Effect-Coded Categorical Predictors-I recently received this great question: Question: Hi Karen, ive purchased a lot of your material and read a lot of your pdf documents w.r.t. regression and interaction terms. Its, now, my general understanding that interaction for two or more categorical variables is best done with effects coding, and interactions cont v. categorical variables is […]
Interpreting Interactions in Regression-Adding interaction terms to a regression model can greatly expand understanding of the relationships among the variables in the model and allows more hypotheses to be tested. The example from Interpreting Regression Coefficients was a model of the height of a shrub (Height) based on the amount of bacteria in the soil (Bacteria) and whether […]
Interpreting Regression Coefficients-Linear regression is one of the most popular statistical techniques. Despite its popularity, interpretation of the regression coefficients of any but the simplest models is sometimes, well….difficult. So let’s interpret the coefficients of a continuous and a categorical variable. Although the example here is a linear regression model, the approach works for interpreting coefficients from […]
Interpreting Regression Coefficients in Models other than Ordinary Linear Regression-So this is the actual model for an ordinary least squares linear regression. The left hand side of the equation is just Y and ε, the error term, has a normal distribution.
For other types of regression models, like logistic regression, Poisson regression, or multilevel models, all the βs and Xs stay the same. The only parts that can differ:
1. Instead of Y on the left, there can be a function of Y--a non-linear transformation.
2. Instead of a normal distribution, the residuals can have another distribution.
Interpreting Regression Coefficients: Changing the scale of predictor variables-Sometimes it makes sense to change the scale of predictor variables so that interpretations of parameter estimates, including odds ratios, make sense. It is generally done by multiplying the values of a predictor by a constant, often a factor of 10. Since parameter estimates and odds ratios tell you the effect of a one unit […]
Is Multicollinearity the Bogeyman?-Multicollinearity occurs when two or more predictor variables in a regression model are redundant. It is a real problem, and it can do terrible things to your results. But it is uncommon, and is often misdiagnosed.
July 2017 Member Webinar: The Multi-Faceted World of Residuals-Residuals can be a very broad topic - one that most everyone has heard of, but few people truly understand. It’s time to change that.
By definition, a “residual” is “the quantity remaining after other things have been subtracted or allowed for.” In statistics, we use the term in a similar fashion.
Residuals come in various forms:
But which ones do we use… and why?
June 2017 Member Webinar: Mediated Moderation and Moderated Mediation-Often a model is not a simple process from a treatment or intervention to the outcome. In essence, the value of X does not always directly predict the value of Y.
Mediators can affect the relationship between X and Y. Moderators can affect the scale and magnitude of that relationship. And sometimes the mediators and moderators affect each other.
Linear Regression Analysis – 3 Common Causes of Multicollinearity and What Do to About Them-Multicollinearity is simply redundancy in the information contained in predictor variables. If the redundancy is moderate, it usually only affects the interpretation of regression coefficients. But if it is severe-at or near perfect redundancy, it causes the model to "blow up." (And yes, that's a technical term). But the reality is that there are only five situations where it commonly occurs. And three of them have very simple solutions.
Making Dummy Codes Easy to Keep Track of-Here’s a little tip. When you construct Dummy Variables, make it easy on yourself to remember which code is which. Heck, if you want to be really nice, make it easy for anyone else who will analyze the data or read the results. Make the codes inherent in the Dummy variable name. So instead of […]
One-tailed and two-tailed tests-I was recently asked about when to use one and two tailed tests. The long answer is: Use one tailed tests when you have a specific hypothesis about the direction of your relationship. Some examples include you hypothesize that one group mean is larger than the other; you hypothesize that the correlation is positive; you […]
Outliers: To Drop or Not to Drop-Outliers are one of those statistical issues that everyone knows about, but most people aren’t sure how to deal with. Most parametric statistics, like means, standard deviations, and correlations, and every statistic based on these, are highly sensitive to outliers. And since the assumptions of common statistical procedures, like linear regression and ANOVA, are also […]
Poisson Regression Analysis for Count Data-There are many dependent variables that no matter how many transformations you try, you cannot get to be normally distributed. The most common culprits are count variables–the variable thatnmeasures the count or rate of some event in a sample. Some examples I’ve seen from a variety of disciplines are: Number of eggs in a clutch […]
Problems Caused by Categorizing Continuous Variables-I just came across this great article by Frank Harrell: Problems Caused by Categorizing Continuous Variables It’s from the Vanderbilt University biostatistics department, so the examples are all medical, but the points hold for any field. It goes right along with my recent post, Continuous and Categorical Variables: The Trouble with Median Splits.
Regression models without intercepts-A recent question on the Talkstats forum asked about dropping the intercept in a linear regression model since it makes the predictor’s coefficient stronger and more significant. Dropping the intercept in a regression model forces the regression line to go through the origin–the y intercept must be 0. The problem with dropping the intercept is […]
Regression Models:How do you know you need a polynomial?-A polynomial term–a quadratic (squared) or cubic (cubed) term turns a linear regression model into a curve. But because it is X that is squared or cubed, not the Beta coefficient, it still qualifies as a linear model. This makes it a nice, straightforward way to model curves without having to model complicated non-linear models. […]
Regression Through the Origin-I just wanted to follow up on my last post about Regression without Intercepts. Regression through the Origin means that you purposely drop the intercept from the model. When X=0, Y must = 0. The thing to be careful about in choosing any regression model is that it fit the data well. Pretty much the […]
September 2017 Member Webinar: Quantile Regression: Going Beyond the Mean-Quantiles (the median, 25th percentile, etc.) are valuable statistical descriptors, but their usefulness doesn’t stop there.
In regression analysis, quantiles can also help answer a broader set of research questions than standard linear regression.
In standard linear regression, the focus is on predicting the mean of a response (or dependent) variable, given a set of predictor variables.
For example, standard linear regression can help us understand how age predicts the mean income of a study population.
Contrast this with quantile regression, which allows us to go beyond the mean of the response variable. Now we can understand how predictor variables predict the entire distribution of the response variable, or one or more relevant features (e.g., center, spread, shape) of this distribution.
Should You Always Center a Predictor on the Mean?-One problem is that the mean age at which infants utter their first word may differ from one sample to another. This means you're not always evaluating that mean that the exact same age. It's not comparable across samples.
So another option is to choose a meaningful value of age that is within the values in the data set. One example may be at 12 months.
SPSS GLM: Choosing Fixed Factors and Covariates-The beauty of the Univariate GLM procedure in SPSS is that it is so flexible. You can use it to analyze regressions, ANOVAs, ANCOVAs with all sorts of interactions, dummy coding, etc. The down side of this flexibility is it is often confusing what to put where and what it all means. So here’s a […]
The 13 Steps for Statistical Modeling in any Regression or ANOVA-No matter what statistical model you’re running, you need to go through the same 13 steps. The order and the specifics of how you do each step will differ depending on the data and the type of model you use. These 13 steps are in 3 major parts. Most people think of only Part 3 […]
The Difference Between Interaction and Association-Interaction is different. Whether two variables are associated says nothing about whether they interact in their effect on a third variable. Likewise, if two variables interact, they may or may not be associated.
The Distribution of Independent Variables in Regression Models-While there are a number of distributional assumptions in regression models, one distribution that has no assumptions is that of any predictor (i.e. independent) variables. It’s because regression models are directional. In a correlation, there is no direction–Y and X are interchangeable. If you switched them, you’d get the same correlation coefficient. But regression is […]
The Exposure Variable in Poisson Regression Models-Poisson Regression Models and its extensions (Zero-Inflated Poisson, Negative Binomial Regression, etc.) are used to model counts and rates. A few examples of count variables include: – Number of words an eighteen month old can say – Number of aggressive incidents performed by patients in an impatient rehab center Most count variables follow one of […]
The Impact of Removing the Constant from a Regression Model: The Categorical Case-In a simple linear regression model how the constant (aka, intercept) is interpreted depends upon the type of predictor (independent) variable.
If the predictor is categorical and dummy-coded, the constant is the mean value of the outcome variable for the reference category only. If the predictor variable is continuous, the constant equals the predicted value of the outcome variable when the predictor variable equals zero.
Understanding Interactions Between Categorical and Continuous Variables in Linear Regression-So we’ve looked at the interaction effect between two categorical variables. But let’s make things a little more interesting, shall we? What if our predictors of interest, say, are a categorical and a continuous variable? How do we interpret the interaction between the two? We’ll keep working with our trusty 2014 General Social Survey data set. But this time let’s examine the impact of job prestige level (a continuous variable) and gender (a categorical, dummy coded variable) as our two predictors.
Using Adjusted Means to Interpret Moderators in Analysis of Covariance-If you use the menus in SPSS, you can only get those EMMeans at the Covariate's mean, which in this example is about 25, where the vertical black line is. This isn't very useful for our purposes. But we can change the value of the covariate at which to compare the means using syntax.
When Assumptions of ANCOVA are Irrelevant-Every once in a while, I work with a client who is stuck between a particular statistical rock and hard place. It happens when they're trying to run an analysis of covariance (ANCOVA) model because they have a categorical independent variables and a continuous covariate.
The problem arises when a coauthor, committee member, or reviewer insists that ANCOVA is inappropriate in this situation because one of the following ANCOVA assumptions are not met: (1) The independent variable and the covariate are independent of each other (2) There is no interaction between independent variable and the covariate.
When Dependent Variables Are Not Fit for Linear Models, Now What?-When your dependent variable is not continuous, unbounded, and measured on an interval or ratio scale, your model will never meet the Assumptions of the General Linear Model (GLM). Today I’m going to go into more detail about these 6 common types of dependent variables, and the tests that work instead. Categorical Variables, including both […]
When Dummy Codes are Backwards, Your Stat Software may be Messing With You-In SAS proc glm, when you specify a predictor as categorical in the CLASS statement, it will automatically dummy code it for you in the parameter estimates table (the regression coefficients). The default reference category--what GLM will code as 0--is the highest value. This works just fine if your values are coded 1, 2, and 3. But if you've dummy coded them already, it's switching them on you.
When to Check Model Assumptions-If any of these fail, it’s nearly impossible to get normally distributed residuals, even with remedial transformations.
Types of variables that will generally fail these criteria include:
Categorical Variables, both nominal and ordinal.
Count Variables, which are often distributed as Poisson or Negative Binomial.
Why ANOVA is Really a Linear Regression, Despite the Difference in Notation-When I was in graduate school, stat professors would say “ANOVA is just a special case of linear regression.” But they never explained why. And I couldn’t figure it out. The model notation is different. The output looks different. The vocabulary is different. The focus of what we’re testing is completely different. How can they […]