• Skip to primary navigation
  • Skip to main content
  • Skip to primary sidebar
The Analysis Factor

The Analysis Factor

Statistical Consulting, Resources, and Statistics Workshops for Researchers

  • Home
  • Our Programs
    • Membership
    • Online Workshops
    • Free Webinars
    • Consulting Services
  • About
    • Our Team
    • Our Core Values
    • Our Privacy Policy
    • Employment
    • Collaborate with Us
  • Statistical Resources
  • Contact
  • Blog
  • Login

SPSS

Logistic Regression Models for Multinomial and Ordinal Variables

by Karen Grace-Martin 59 Comments

Multinomial Logistic Regression

The multinomial (a.k.a. polytomous) logistic regression model is a simple extension of the binomial logistic regression model.  They are used when the dependent variable has more than two nominal (unordered) categories.

Dummy coding of independent variables is quite common.  In multinomial logistic regression the dependent variable is dummy coded into multiple 1/0 variables.  There is a variable for all categories but one, so if there are M categories, there will be M-1 dummy variables.  All but one category has its own dummy variable.  Each category’s dummy variable has a value of 1 for its category and a 0 for all others.  One category, the reference category, doesn’t need its own dummy variable as it is uniquely identified by all the other variables being 0.

The multinomial logistic regression then estimates a separate binary logistic regression model for each of those dummy variables.  The result is [Read more…] about Logistic Regression Models for Multinomial and Ordinal Variables

Tagged With: Binary Logistic Regression, dummy variable, Multinomial Logistic Regression, Ordinal Logistic Regression, Proportional Odds Model

Related Posts

  • How to Decide Between Multinomial and Ordinal Logistic Regression Models
  • Opposite Results in Ordinal Logistic Regression, Part 2
  • Opposite Results in Ordinal Logistic Regression—Solving a Statistical Mystery
  • Confusing Statistical Terms #1: The Many Names of Independent Variables

Variable Labels and Value Labels in SPSS

by Karen Grace-Martin 101 Comments

SPSS Variable Labels and Value Labels are two of the great features of its ability to create a code book right in the data set.  Using these every time is good data analysis practice.

SPSS doesn’t limit variable names to 8 characters like it used to, but you still can’t use spaces, and it will make coding easier if you keep the variable names short.  You then use Variable Labels to give a nice, long description of each variable.  On questionnaires, I often use the actual question.

There are good reasons for using Variable Labels right in the data set.  I know you want to get right to your data analysis, but using Variable Labels will save so much time later.

1. If your paper code sheet ever gets lost, you still have the variable names.

2. Anyone else who uses your data–lab assistants, graduate students, statisticians–will immediately know what each variable means.

3. As entrenched as you are with your data right now, you will forget what those variable names refer to within months.  When a committee member or reviewer wants you to redo an analysis, it will save tons of time to have those variable labels right there.

4.  It’s just more efficient–you don’t have to look up what those variable names mean when you read your output.

Variable Labels

The really nice part is SPSS makes Variable Labels easy to use:

1. Mouse over the variable name in the Data View spreadsheet to see the Variable Label.

2. In dialog boxes, lists of variables can be shown with either Variable Names or Variable Labels.  Just go to Edit–>Options.  In the General tab, choose Display Labels.

3. On the output, SPSS allows you to print out Variable Names or Variable Labels or both.  I usually like to have both.  Just go to Edit–>Options.  In the Output tab, choose ‘Names and Labels’ in the first and third boxes.

Value Labels

Value Labels are similar, but Value Labels are descriptions of the values a variable can take.  Labeling values right in SPSS means you don’t have to remember if 1=Strongly Agree and 5=Strongly Disagree or vice-versa.  And it makes data entry much more efficient–you can type in 1 and 0 for Male and Female much faster than you can type out those whole words, or even M and F.  But by having Value Labels, your data and output still give you the meaningful values.

Once again, SPSS makes it easy for you.

1. If you’d rather see Male and Female in the data set than 0 and 1, go to View–>Value Labels.

2. Like Variable Labels, you can get Value Labels on output, along with the actual values.  Just go to Edit–>Options.  In the ‘Output Labels’ tab, choose ‘Values and Labels’ in the second and fourth boxes.

Bookmark and Share

Tagged With: SPSS, Value Labels, variable labels

Related Posts

  • How to Get a Code Book from SPSS
  • Tricks for Using Word to Make Statistical Syntax Easier
  • 3 Pieces of SPSS Syntax to Keep Handy
  • Averaging and Adding Variables with Missing Data in SPSS

SPSS GLM: Choosing Fixed Factors and Covariates

by Karen Grace-Martin 87 Comments

The beauty of the Univariate GLM procedure in SPSS is that it is so flexible.  You can use it to analyze regressions, ANOVAs, ANCOVAs with all sorts of interactions, dummy coding, etc.

The down side of this flexibility is it is often confusing what to put where and what it all means.

So here’s a quick breakdown.

The dependent variable I hope is pretty straightforward.  Put in your continuous dependent variable.

Fixed Factors are categorical independent variables.  It does not matter if the variable is [Read more…] about SPSS GLM: Choosing Fixed Factors and Covariates

Tagged With: analysis of covariance, ancova, ANOVA, Covariate, dummy coding, Fixed Factor, linear regression, post hoc test, SPSS GLM

Related Posts

  • Dummy Coding in SPSS GLM–More on Fixed Factors, Covariates, and Reference Groups, Part 1
  • The General Linear Model, Analysis of Covariance, and How ANOVA and Linear Regression Really are the Same Model Wearing Different Clothes
  • Dummy Coding in SPSS GLM–More on Fixed Factors, Covariates, and Reference Groups, Part 2
  • Why ANOVA and Linear Regression are the Same Analysis

Confusing Statistical Terms #1: The Many Names of Independent Variables

by Karen Grace-Martin 9 Comments

Statistical models, such as general linear models (linear regression, ANOVA, MANOVA), linear mixed models, and generalized linear models (logistic, Poisson, regression, etc.) all have the same general form.

On the left side of the equation is one or more response variables, Y. On the right hand side is one or more predictor variables, X, and their coefficients, B. The variables on the right hand side can have many forms and are called by many names.

There are subtle distinctions in the meanings of these names. Unfortunately, though, there are two practices that make them more confusing than they need to be.

First, they are often used interchangeably. So someone may use “predictor variable” and “independent variable” interchangably and another person may not. So the listener may be reading into the subtle distinctions that the speaker may not be implying.

Second, the same terms are used differently in different fields or research situations. So if you are an epidemiologist who does research on mostly observed variables, you probably have been trained with slightly different meanings to some of these terms than if you’re a psychologist who does experimental research.

Even worse, statistical software packages use different names for similar concepts, even among their own procedures. This quest for accuracy often renders confusion. (It’s hard enough without switching the words!).

Here are some common terms that all refer to a variable in a model that is proposed to affect or predict another variable.

I’ll give you the different definitions and implications, but it’s very likely that I’m missing some. If you see a term that means something different than you understand it, please add it to the comments. And please tell us which field you primarily work in.

Predictor Variable, Predictor

This is the most generic of the terms. There are no implications for being manipulated, observed, categorical, or numerical. It does not imply causality.

A predictor variable is simply used for explaining or predicting the value of the response variable. Used predominantly in regression.

Independent Variable

I’ve seen Independent Variable (IV) used different ways.

1. It implies causality: the independent variable affects the dependent variable. This usage is predominant in ANOVA models where the Independent Variable is manipulated by the experimenter. If it is manipulated, it’s generally categorical and subjects are randomly assigned to conditions.

2. It does not imply causality, but it is a key predictor variable for answering the research question. In other words, it is in the model because the researcher is interested in understanding its relationship with the dependent variable. In other words, it’s not a control variable.

3. It does not imply causality or the importance of the variable to the research question. But it is uncorrelated (independent) of all other predictors.

Honestly, I only recently saw someone define the term Independent Variable this way. Predictor Variables cannot be independent variables if they are at all correlated. It surprised me, but it’s good to know that some people mean this when they use the term.

Explanatory Variable

A predictor variable in a model where the main point is not to predict the response variable, but to explain a relationship between X and Y.

Control Variable

A predictor variable that could be related to or affecting the dependent variable, but not really of interest to the research question.

Covariate

Generally a continuous predictor variable. Used in both ANCOVA (analysis of covariance) and regression. Some people use this to refer to all predictor variables in regression, but it really means continuous predictors. Adding a covariate to ANOVA (analysis of variance) turns it into ANCOVA (analysis of covariance).

Sometimes covariate implies that the variable is a control variable (as opposed to an independent variable), but not always.

And sometimes people use covariate to mean control variable, either numerical or categorical.

This one is so confusing it got it’s own Confusing Statistical Terms article.

Confounding Variable, Confounder

These terms are used differently in different fields. In experimental design, it’s used to mean a variable whose effect cannot be distinguished from the effect of an independent variable.

In observational fields, it’s used to mean one of two situations. The first is a variable that is so correlated with an independent variable that it’s difficult to separate out their effects on the response variable. The second is a variable that causes the independent variable’s effect on the response.

The distinction in those interpretations are slight but important.

Exposure Variable

This is a term for independent variable in some fields, particularly epidemiology. It’s the key predictor variable.

Risk Factor

Another epidemiology term for a predictor variable. Unlike the term “Factor” listed below, it does not imply a categorical variable.

Factor

A categorical predictor variable. It may or may not indicate a cause/effect relationship with the response variable (this depends on the study design, not the analysis).

Independent variables in ANOVA are almost always called factors. In regression, they are often referred to as indicator variables, categorical predictors, or dummy variables. They are all the same thing in this context.

Also, please note that Factor has completely other meanings in statistics, so it too got its own Confusing Statistical Terms article.

Feature

Used in Machine Learning and Predictive models, this is simply a predictor variable.

Grouping Variable

Same as a factor.

Fixed factor

A categorical predictor variable in which the specific values of the categories are intentional and important, often chosen by the experimenter. Examples include experimental treatments or demographic categories, such as sex and race.

If you’re not doing a mixed model (and you should know if you are), all your factors are fixed factors. For a more thorough explanation of fixed and random factors, see Specifying Fixed and Random Factors in Mixed or Multi-Level Models

Random factor

A categorical predictor variable in which the specific values of the categories were randomly assigned. Generally used in mixed modeling. Examples include subjects or random blocks.

For a more thorough explanation of fixed and random factors, see Specifying Fixed and Random Factors in Mixed or Multi-Level Models

Blocking variable

This term is generally used in experimental design, but I’ve also seen it in randomized controlled trials.

A blocking variable is a variable that indicates an experimental block: a cluster or experimental unit that restricts complete randomization and that often results in similar response values among members of the block.

Blocking variables can be either fixed or random factors. They are never continuous.

Dummy variable

A categorical variable that has been dummy coded. Dummy coding (also called indicator coding) is usually used in regression models, but not ANOVA. A dummy variable can have only two values: 0 and 1. When a categorical variable has more than two values, it is recoded into multiple dummy variables.

Indicator variable

Same as dummy variable.

The Take Away Message

Whenever you’re using technical terms in a report, an article, or a conversation, it’s always a good idea to define your terms. This is especially important in statistics, which is used in many, many fields, each of whom adds their own subtleties to the terminology.

 

Confusing Statistical Terms Series

Confusing Statistical Terms #1: The Many Names of Independent Variables

Confusing Statistical Terms #2: Alpha and Beta

Confusing Statistical Terms #3: Levels

Confusing Statistical Term #4: Hierarchical Regression vs. Hierarchical Model

Confusing Statistical Term #5: Covariate

Confusing Statistical Term #6: Factor

Confusing Statistical Term #7: GLM

Tagged With: ANOVA, Covariate, dummy variable, explanatory variable, Fixed Factor, independent variable, predictor variable, Random Factor

Related Posts

  • SPSS GLM: Choosing Fixed Factors and Covariates
  • Dummy Coding in SPSS GLM–More on Fixed Factors, Covariates, and Reference Groups, Part 1
  • Same Statistical Models, Different (and Confusing) Output Terms
  • What’s in a Name? Moderation and Interaction, Independent and Predictor Variables

Regression Models for Count Data

by Karen Grace-Martin 16 Comments

One of the main assumptions of linear models such as linear regression and analysis of variance is that the residual errors follow a normal distribution. To meet this assumption when a continuous response variable is skewed, a transformation of the response variable can produce errors that are approximately normal. Often, however, the response variable of interest is categorical or discrete, not continuous. In this case, a simple transformation cannot produce normally distributed errors.

A common example is when the response variable is the counted number of occurrences of an event. The distribution of counts is discrete, not continuous, and is limited to non-negative values. There are two problems with applying an ordinary linear regression model to these data. First, many distributions of count data are positively skewed with many observations in the data set having a value of 0. The high number of 0’s in the data set prevents the transformation of a skewed distribution into a normal one. Second, it is quite likely that the regression model will produce negative predicted values, which are theoretically impossible.

An example of a regression model with a count response variable is the prediction of the number of times a person perpetrated domestic violence against his or her partner in the last year based on whether he or she had witnessed domestic violence as a child and who the perpetrator of that violence was. Because many individuals in the sample had not perpetrated violence at all, many observations had a value of 0, and any attempts to transform the data to a normal distribution failed.

An alternative is to use a Poisson regression model or one of its variants. These models have a number of advantages over an ordinary linear regression model, including a skew, discrete distribution, and the restriction of predicted values to non-negative numbers. A Poisson model is similar to an ordinary linear regression, with two exceptions. First, it assumes that the errors follow a Poisson, not a normal, distribution. Second, rather than modeling Y as a linear function of the regression coefficients, it models the natural log of the response variable, ln(Y), as a linear function of the coefficients.

The Poisson model assumes that the mean and variance of the errors are equal. But usually in practice the variance of the errors is larger than the mean (although it can also be smaller). When the variance is larger than the mean, there are two extensions of the Poisson model that work well. In the over-dispersed Poisson model, an extra parameter is included which estimates how much larger the variance is than the mean. This parameter estimate is then used to correct for the effects of the larger variance on the p-values. An alternative is a negative binomial model. The negative binomial distribution is a form of the Poisson distribution in which the distribution’s parameter is itself considered a random variable. The variation of this parameter can account for a variance of the data that is higher than the mean.

A negative binomial model proved to fit well for the domestic violence data described above. Because the majority of individuals in the data set perpetrated 0 times, but a few individuals perpetrated many times, the variance was over 6 times larger than the mean. Therefore, the negative binomial model was clearly more appropriate than the Poisson.

All three variations of the Poisson regression model are available in many general statistical packages, including SAS, Stata, and S-Plus.

References:

  • Gardner, W., Mulvey, E.P., and Shaw, E.C (1995). “Regression Analyses of Counts and Rates: Poisson, Overdispersed Poisson, and Negative Binomial Models”, Psychological Bulletin, 118, 392-404.
  • Long, J.S. (1997). Regression Models for Categorical and Limited Dependent Variables, Chapter 8. Thousand Oaks, CA: Sage Publications.

Tagged With: Count data, count models, Negative Binomial Regression, Poisson Regression

Related Posts

  • The Exposure Variable in Poisson Regression Models
  • A Few Resources on Zero-Inflated Poisson Models
  • Poisson Regression Analysis for Count Data
  • The Importance of Including an Exposure Variable in Count Models

Averaging and Adding Variables with Missing Data in SPSS

by Karen Grace-Martin 34 Comments

SPSS has a nice little feature for adding and averaging variables with missing data that many people don’t know about.

It allows you to add or average variables, while specifying how many are allowed to be missing.

For example, a very common situation is a researcher needs to average the values of the 5 variables on a scale, each of which is measured on the same Likert scale.

There are two ways to do this in SPSS syntax.

Newvar=(X1 + X2 + X3 + X4 + X5)/5  or

Newvar=MEAN(X1,X2, X3, X4, X5).

In the first method, if any of the variables are missing, due to SPSS’s default of listwise deletion, Newvar will also be missing.

In the second method, if any of the variables is missing, it will still calculate the mean.  While this seems great at first,  the researcher may wish to limit how many of the 5 variables need to be observed in order to calculate the mean.  If only one or two variables are present, the mean may not be a reasonable estimate of the mean of all 5 variables.

SPSS has an option for dealing with this situation.  Running it the following way will only calculate the mean if any 4 of the 5 variables is observed.  If fewer than 4 of the variables are observed, Newvar will be system missing.

Newvar=MEAN.4(X1,X2, X3, X4, X5).

You can specify any number of variables that need to be observed.

(This same distinction holds for the SUM function in SPSS, but the scale changes based on how many are being averaged.  A better approach is to calculate the mean, then multiply by 5).

This works the same way in the syntax or in the Transform–>Compute menu dialog.

First Published  12/1/2016;
Updated  7/20/21 to give more detail.

Tagged With: listwise deletion, Missing Data, SPSS, spss syntax

Related Posts

  • SPSS Syntax 101
  • Multiple Imputation in a Nutshell
  • Tricks for Using Word to Make Statistical Syntax Easier
  • Two Recommended Solutions for Missing Data: Multiple Imputation and Maximum Likelihood

  • « Go to Previous Page
  • Go to page 1
  • Interim pages omitted …
  • Go to page 8
  • Go to page 9
  • Go to page 10

Primary Sidebar

This Month’s Statistically Speaking Live Training

  • Member Training: A Gentle Introduction to Bootstrapping

Upcoming Free Webinars

Getting Started with R
3 Overlooked Strengths of Structural Equation Modeling
4 Critical Steps in Building Linear Regression Models

Upcoming Workshops

    No Events

Copyright © 2008–2022 The Analysis Factor, LLC. All rights reserved.
877-272-8096   Contact Us

The Analysis Factor uses cookies to ensure that we give you the best experience of our website. If you continue we assume that you consent to receive cookies on all websites from The Analysis Factor.
Continue Privacy Policy
Privacy & Cookies Policy

Privacy Overview

This website uses cookies to improve your experience while you navigate through the website. Out of these, the cookies that are categorized as necessary are stored on your browser as they are essential for the working of basic functionalities of the website. We also use third-party cookies that help us analyze and understand how you use this website. These cookies will be stored in your browser only with your consent. You also have the option to opt-out of these cookies. But opting out of some of these cookies may affect your browsing experience.
Necessary
Always Enabled
Necessary cookies are absolutely essential for the website to function properly. This category only includes cookies that ensures basic functionalities and security features of the website. These cookies do not store any personal information.
Non-necessary
Any cookies that may not be particularly necessary for the website to function and is used specifically to collect user personal data via analytics, ads, other embedded contents are termed as non-necessary cookies. It is mandatory to procure user consent prior to running these cookies on your website.
SAVE & ACCEPT