Karen Grace-Martin

Mixed Up Mixed Models

November 17th, 2008 by Karen Grace-Martin

A great article for specifying Mixed Models in SAS:

Mixed up Mixed Models
by Robert Harner & P.M. Simpson

Anyone doing mixed modeling in SAS should read this paper, originally presented at SUGI: SAS Users Group International conference. It compares the output from Proc Mixed and Proc GLM when specified different ways. There are some subtle distinctions in the meaning of the defaults in the Repeated and Random statements, and this paper does an excellent job of clarifying them.

1 comment

Regression Through the Origin

November 13th, 2008 by Karen Grace-Martin

I just wanted to follow up on my last post about Regression without Intercepts.

Regression through the Origin means that you purposely drop the intercept from the model. When X=0, Y must = 0.

The thing to be careful about in choosing any regression model is that it fit the data well. Pretty much the only time that a regression through the origin will fit better than a model with an intercept is if the point X=0, Y=0 is required by the data.

Yes, leaving out the intercept will increase your df by 1, since you’re not estimating one parameter. But unless your sample size is really, really small, it won’t matter. So it really has no advantages.

4 comments

Regression Models for Count Data

October 24th, 2008 by Karen Grace-Martin

One of the main assumptions of linear models such as linear regression and analysis of variance is that the residual errors follow a normal distribution. To meet this assumption when a continuous response variable is skewed, a transformation of the response variable can produce errors that are approximately normal. Often, however, the response variable of interest is categorical or discrete, not continuous. In this case, a simple transformation cannot produce normally distributed errors.

A common example is when the response variable is the counted number of occurrences of an event. The distribution of counts is discrete, not continuous, and is limited to non-negative values. There are two problems with applying an ordinary linear regression model to these data. First, many distributions of count data are positively skewed with many observations in the data set having a value of 0. The high number of 0’s in the data set prevents the transformation of a skewed distribution into a normal one. Second, it is quite likely that the regression model will produce negative predicted values, which are theoretically impossible.

An example of a regression model with a count response variable is the prediction of the number of times a person perpetrated domestic violence against his or her partner in the last year based on whether he or she had witnessed domestic violence as a child and who the perpetrator of that violence was. Because many individuals in the sample had not perpetrated violence at all, many observations had a value of 0, and any attempts to transform the data to a normal distribution failed.

An alternative is to use a Poisson regression model or one of its variants. These models have a number of advantages over an ordinary linear regression model, including a skew, discrete distribution, and the restriction of predicted values to non-negative numbers. A Poisson model is similar to an ordinary linear regression, with two exceptions. First, it assumes that the errors follow a Poisson, not a normal, distribution. Second, rather than modeling Y as a linear function of the regression coefficients, it models the natural log of the response variable, ln(Y), as a linear function of the coefficients.

The Poisson model assumes that the mean and variance of the errors are equal. But usually in practice the variance of the errors is larger than the mean (although it can also be smaller). When the variance is larger than the mean, there are two extensions of the Poisson model that work well. In the over-dispersed Poisson model, an extra parameter is included which estimates how much larger the variance is than the mean. This parameter estimate is then used to correct for the effects of the larger variance on the p-values. An alternative is a negative binomial model. The negative binomial distribution is a form of the Poisson distribution in which the distribution’s parameter is itself considered a random variable. The variation of this parameter can account for a variance of the data that is higher than the mean.

A negative binomial model proved to fit well for the domestic violence data described above. Because the majority of individuals in the data set perpetrated 0 times, but a few individuals perpetrated many times, the variance was over 6 times larger than the mean. Therefore, the negative binomial model was clearly more appropriate than the Poisson.

All three variations of the Poisson regression model are available in many general statistical packages, including SAS, Stata, and S-Plus.

References:

Gardner, W., Mulvey, E.P., and Shaw, E.C (1995). “Regression Analyses of Counts and Rates: Poisson, Overdispersed Poisson, and Negative Binomial Models”, Psychological Bulletin, 118, 392-404.
Long, J.S. (1997). Regression Models for Categorical and Limited Dependent Variables, Chapter 8. Thousand Oaks, CA: Sage Publications.

16 comments

With SAS, it’s almost always the semicolons!

October 23rd, 2008 by Karen Grace-Martin

With SAS, it’s almost always the semicolons!

I read that recently–I can’t remember where now (if you wrote it, let me know–I’ll link!).

I spent the day at Cornell doing SAS programming; I kept expecting Andy Bernard to show up.

Anyway, I was reminded of that quote because, as you guessed it, it was almost always that I forgot a semicolon when I got a SAS error. The errors never mention a semicolon, but that is always the first thing to check for.

I was also reminded today that another common mistake is to forget the SET statement when creating a new data set. If SAS tells you your variables don’t exist, but you know they do, it’s because you forgot the SET statement.

I was also reminded of my best SAS time saver ever. F12: output;cle;log;cle;wpgm. At first I set it up wrong and was lost for a few minutes without it. Luckily I discovered my mistake pretty quickly.

No comments yet

Unequal Sample Sizes

October 9th, 2008 by Karen Grace-Martin

My next door neighbor, who is a mycologist (hey, it’s Ithaca–everyone’s a researcher here) asked me a very common statistical question–she was very concerned about her unequal sample sizes. She was doing a chi-square test and had about 11 observations in one grouping and 18 in the other.

She had already talked with a statistical consultant, who said the unequal samples were not a problem. But since she still didn’t understand why, she asked me.

As it turns out, I had just finished writing a document called “7 Statistical Issues that Researchers Shouldn’t Worry (So Much) About.” #4 is unequal sample sizes.

I wrote this because there are a number of issues that are of common concern to researchers but really aren’t a problem in all but the rarest situations. Yet they eat up a lot of mental energy.

So worry no more–get the report for free at The Analysis Factor, and read all about unequal sample sizes and more.

12 comments

The Second Problem with Mean Imputation

October 2nd, 2008 by Karen Grace-Martin

A previous post discussed the first reason to not use mean imputation as a way of dealing with missing data–it does not preserve the relationships among variables.

A second reason is that any type of single imputation underestimates error variation in any statistic that used the imputed data. Because the imputations are themselves estimates, there is some error associated with them. But your statistical software doesn’t know that. It treats it as real data.

Ultimately, because your standard errors are too low, so are your p-values. Now you’re making Type I errors without realizing it.

A better approach? Mulitple Imputation or Full Information Maximum Likelihood.

No comments yet