Regression models

Centering and Standardizing Predictors

December 5th, 2008 by

I was recently asked about whether centering (subtracting the mean) a predictor variable in a regression model has the same effect as standardizing (converting it to a Z score).  My response:

They are similar but not the same.

In centering, you are changing the values but not the scale.  So a predictor that is centered at the mean has new values–the entire scale has shifted so that the mean now has a value of 0, but one unit is still one unit.  The intercept will change, but the regression coefficient for that variable will not.  Since the regression coefficient is interpreted as the effect on the mean of Y for each one unit difference in X, it doesn’t change when X is centered.

And incidentally, despite the name, you don’t have to center at the mean.  It is often convenient, but there can be advantages of choosing a more meaningful value that is also toward the center of the scale.

But a Z-score also changes the scale.  A one-unit difference now means a one-standard deviation difference.  You will interpret the coefficient differently.  This is usually done so you can compare coefficients for predictors that were measured on different scales.  I can’t think of an advantage for doing this for an interaction.

 


Confusing Statistical Terms #1: The Many Names of Independent Variables

November 24th, 2008 by

Statistical models, such as general linear models (linear regression, ANOVA, MANOVA), linear mixed models, and generalized linear models (logistic, Poisson, regression, etc.) all have the same general form.

On the left side of the equation is one or more response variables, Y. On the right hand side is one or more predictor variables, X, and their coefficients, B. The variables on the right hand side can have many forms and are called by many names.

There are subtle distinctions in the meanings of these names. Unfortunately, though, there are two practices that make them more confusing than they need to be.

First, they are often used interchangeably. So someone may use “predictor variable” and “independent variable” interchangably and another person may not. So the listener may be reading into the subtle distinctions that the speaker may not be implying.

Second, the same terms are used differently in different fields or research situations. So if you are an epidemiologist who does research on mostly observed variables, you probably have been trained with slightly different meanings to some of these terms than if you’re a psychologist who does experimental research.

Even worse, statistical software packages use different names for similar concepts, even among their own procedures. This quest for accuracy often renders confusion. (It’s hard enough without switching the words!).

Here are some common terms that all refer to a variable in a model that is proposed to affect or predict another variable.

I’ll give you the different definitions and implications, but it’s very likely that I’m missing some. If you see a term that means something different than you understand it, please add it to the comments. And please tell us which field you primarily work in.

Predictor Variable, Predictor

This is the most generic of the terms. There are no implications for being manipulated, observed, categorical, or numerical. It does not imply causality.

A predictor variable is simply used for explaining or predicting the value of the response variable. Used predominantly in regression.

Independent Variable

I’ve seen Independent Variable (IV) used different ways.

1. It implies causality: the independent variable affects the dependent variable. This usage is predominant in ANOVA models where the Independent Variable is manipulated by the experimenter. If it is manipulated, it’s generally categorical and subjects are randomly assigned to conditions.

2. It does not imply causality, but it is a key predictor variable for answering the research question. In other words, it is in the model because the researcher is interested in understanding its relationship with the dependent variable. In other words, it’s not a control variable.

3. It does not imply causality or the importance of the variable to the research question. But it is uncorrelated (independent) of all other predictors.

Honestly, I only recently saw someone define the term Independent Variable this way. Predictor Variables cannot be independent variables if they are at all correlated. It surprised me, but it’s good to know that some people mean this when they use the term.

Explanatory Variable

A predictor variable in a model where the main point is not to predict the response variable, but to explain a relationship between X and Y.

Control Variable

A predictor variable that could be related to or affecting the dependent variable, but not really of interest to the research question.

Covariate

Generally a continuous predictor variable. Used in both ANCOVA (analysis of covariance) and regression. Some people use this to refer to all predictor variables in regression, but it really means continuous predictors. Adding a covariate to ANOVA (analysis of variance) turns it into ANCOVA (analysis of covariance).

Sometimes covariate implies that the variable is a control variable (as opposed to an independent variable), but not always.

And sometimes people use covariate to mean control variable, either numerical or categorical.

This one is so confusing it got it’s own Confusing Statistical Terms article.

Confounding Variable, Confounder

These terms are used differently in different fields. In experimental design, it’s used to mean a variable whose effect cannot be distinguished from the effect of an independent variable.

In observational fields, it’s used to mean one of two situations. The first is a variable that is so correlated with an independent variable that it’s difficult to separate out their effects on the response variable. The second is a variable that causes the independent variable’s effect on the response.

The distinction in those interpretations are slight but important.

Exposure Variable

This is a term for independent variable in some fields, particularly epidemiology. It’s the key predictor variable.

Risk Factor

Another epidemiology term for a predictor variable. Unlike the term “Factor” listed below, it does not imply a categorical variable.

Factor

A categorical predictor variable. It may or may not indicate a cause/effect relationship with the response variable (this depends on the study design, not the analysis).

Independent variables in ANOVA are almost always called factors. In regression, they are often referred to as indicator variables, categorical predictors, or dummy variables. They are all the same thing in this context.

Also, please note that Factor has completely other meanings in statistics, so it too got its own Confusing Statistical Terms article.

Feature

Used in Machine Learning and Predictive models, this is simply a predictor variable.

Grouping Variable

Same as a factor.

Fixed factor

A categorical predictor variable in which the specific values of the categories are intentional and important, often chosen by the experimenter. Examples include experimental treatments or demographic categories, such as sex and race.

If you’re not doing a mixed model (and you should know if you are), all your factors are fixed factors. For a more thorough explanation of fixed and random factors, see Specifying Fixed and Random Factors in Mixed or Multi-Level Models

Random factor

A categorical predictor variable in which the specific values of the categories were randomly assigned. Generally used in mixed modeling. Examples include subjects or random blocks.

For a more thorough explanation of fixed and random factors, see Specifying Fixed and Random Factors in Mixed or Multi-Level Models

Blocking variable

This term is generally used in experimental design, but I’ve also seen it in randomized controlled trials.

A blocking variable is a variable that indicates an experimental block: a cluster or experimental unit that restricts complete randomization and that often results in similar response values among members of the block.

Blocking variables can be either fixed or random factors. They are never continuous.

Dummy variable

A categorical variable that has been dummy coded. Dummy coding (also called indicator coding) is usually used in regression models, but not ANOVA. A dummy variable can have only two values: 0 and 1. When a categorical variable has more than two values, it is recoded into multiple dummy variables.

Indicator variable

Same as dummy variable.

The Take Away Message

Whenever you’re using technical terms in a report, an article, or a conversation, it’s always a good idea to define your terms. This is especially important in statistics, which is used in many, many fields, each of whom adds their own subtleties to the terminology.

 

Confusing Statistical Terms Series

Confusing Statistical Terms #1: The Many Names of Independent Variables

Confusing Statistical Terms #2: Alpha and Beta

Confusing Statistical Terms #3: Levels

Confusing Statistical Term #4: Hierarchical Regression vs. Hierarchical Model

Confusing Statistical Term #5: Covariate

Confusing Statistical Term #6: Factor

Confusing Statistical Term #7: GLM

 


One-tailed and Two-tailed Tests

November 19th, 2008 by

I was recently asked about when to use one and two tailed tests.

The long answer is:  Use one tailed tests when you have a specific hypothesis about the direction of your relationship.  Some examples include you hypothesize that one group mean is larger than the other; you hypothesize that the correlation is positive; you hypothesize that the proportion is below .5.

The short answer is: Never use one tailed tests.

Why?

1. Only a few statistical tests even can have one tail: z tests and t tests.  So you’re severely limited.  F tests, Chi-square tests, etc. can’t accommodate one-tailed tests because their distributions are not symmetric.  Most statistical methods, such as regression and ANOVA, are based on these tests, so you will rarely have the chance to implement them.

2. Probably because they are rare, reviewers balk at one-tailed tests.  They tend to assume that you are trying to artificially boost the power of your test.  Theoretically, however, there is nothing wrong with them when the hypothesis and the statistical test are right for them.

 


Regression Through the Origin

November 13th, 2008 by

I just wanted to follow up on my last post about Regression without Intercepts.Stage 2

Regression through the Origin means that you purposely drop the intercept from the model.  When X=0, Y must = 0.

The thing to be careful about in choosing any regression model is that it fit the data well.  Pretty much the only time that a regression through the origin will fit better than a model with an intercept is if the point X=0, Y=0 is required by the data.

Yes, leaving out the intercept will increase your df by 1, since you’re not estimating one parameter.  But unless your sample size is really, really small, it won’t matter.  So it really has no advantages.

 


Regression Models for Count Data

October 24th, 2008 by

One of the main assumptions of linear models such as linear regression and analysis of variance is that the residual errors follow a normal distribution. To meet this assumption when a continuous response variable is skewed, a transformation of the response variable can produce errors that are approximately normal. Often, however, the response variable of interest is categorical or discrete, not continuous. In this case, a simple transformation cannot produce normally distributed errors.

A common example is when the response variable is the counted number of occurrences of an event. The distribution of counts is discrete, not continuous, and is limited to non-negative values. There are two problems with applying an ordinary linear regression model to these data. First, many distributions of count data are positively skewed with many observations in the data set having a value of 0. The high number of 0’s in the data set prevents the transformation of a skewed distribution into a normal one. Second, it is quite likely that the regression model will produce negative predicted values, which are theoretically impossible.

An example of a regression model with a count response variable is the prediction of the number of times a person perpetrated domestic violence against his or her partner in the last year based on whether he or she had witnessed domestic violence as a child and who the perpetrator of that violence was. Because many individuals in the sample had not perpetrated violence at all, many observations had a value of 0, and any attempts to transform the data to a normal distribution failed.

An alternative is to use a Poisson regression model or one of its variants. These models have a number of advantages over an ordinary linear regression model, including a skew, discrete distribution, and the restriction of predicted values to non-negative numbers. A Poisson model is similar to an ordinary linear regression, with two exceptions. First, it assumes that the errors follow a Poisson, not a normal, distribution. Second, rather than modeling Y as a linear function of the regression coefficients, it models the natural log of the response variable, ln(Y), as a linear function of the coefficients.

The Poisson model assumes that the mean and variance of the errors are equal. But usually in practice the variance of the errors is larger than the mean (although it can also be smaller). When the variance is larger than the mean, there are two extensions of the Poisson model that work well. In the over-dispersed Poisson model, an extra parameter is included which estimates how much larger the variance is than the mean. This parameter estimate is then used to correct for the effects of the larger variance on the p-values. An alternative is a negative binomial model. The negative binomial distribution is a form of the Poisson distribution in which the distribution’s parameter is itself considered a random variable. The variation of this parameter can account for a variance of the data that is higher than the mean.

A negative binomial model proved to fit well for the domestic violence data described above. Because the majority of individuals in the data set perpetrated 0 times, but a few individuals perpetrated many times, the variance was over 6 times larger than the mean. Therefore, the negative binomial model was clearly more appropriate than the Poisson.

All three variations of the Poisson regression model are available in many general statistical packages, including SAS, Stata, and S-Plus.

References:

 


Introduction to Logistic Regression

September 26th, 2008 by

Researchers are often interested in setting up a model to analyze the relationship between some predictors (i.e., independent variables) and a response (i.e., dependent variable). Linear regression is commonly used when the response variable is continuous.  One assumption of linear models is that the residual errors follow a normal distribution. This assumption fails when the response variable is categorical, so an ordinary linear model is not appropriate. This article presents a regression model for a response variable that is dichotomous–having two categories. Examples are common: whether a plant lives or dies, whether a survey respondent agrees or disagrees with a statement, or whether an at-risk child graduates or drops out from high school.

In ordinary linear regression, the response variable (Y) is a linear function of the coefficients (B0, B1, etc.) that correspond to the predictor variables (X1, X2, etc.). A typical model would look like:

Y = B0 + B1*X1 + B2*X2 + B3*X3 + … + E

For a dichotomous response variable, we could set up a similar linear model to predict individuals’ category memberships if numerical values are used to represent the two categories. Arbitrary values of 1 and 0 are chosen for mathematical convenience. Using the first example, we would assign Y = 1 if a plant lives and Y = 0 if a plant dies.

This linear model does not work well for a few reasons. First, the response values, 0 and 1, are arbitrary, so modeling the actual values of Y is not exactly of interest. Second, it is really the probability that each individual in the population responds with 0 or 1 that we are interested in modeling. For example, we may find that plants with a high level of a fungal infection (X1) fall into the category “the plant lives” (Y) less often than those plants with low level of infection. Thus, as the level of infection rises, the probability of a plant living decreases.

Thus, we might consider modeling P, the probability, as the response variable. Again, there are problems. Although the general decrease in probability is accompanied by a general increase in infection level, we know that P, like all probabilities, can only fall within the boundaries of 0 and 1. Consequently, it is better to assume that the relationship between X1 and P is sigmoidal (S-shaped), rather than a straight line.

It is possible, however, to find a linear relationship between X1 and a function of P. Although a number of functions work, one of the most useful is the logit function. It is the natural log of the odds that Y is equal to 1, which is simply the ratio of the probability that Y is 1 divided by the probability that Y is 0. The relationship between the logit of P and P itself is sigmoidal in shape. The regression equation that results is:

ln[P/(1-P)] = B0 + B1*X1 + B2*X2 + …

Although the left side of this equation looks intimidating, this way of expressing the probability results in the right side of the equation being linear and looking familiar to us. This helps us understand the meaning of the regression coefficients. The coefficients can easily be transformed so that their interpretation makes sense.

The logistic regression equation can be extended beyond the case of a dichotomous response variable to the cases of ordered categories and polytymous categories (more than two categories).