OptinMon

Variable Formats in SPSS Syntax

October 21st, 2010 by Karen Grace-Martin

One of the places that SPSS syntax excels at efficiency is when you’re creating new variables. This is especially true when you’re creating a LOT of new variables, but even one or two can be quicker if you write the syntax code instead of menus.

And just as importantly, you’ll have documentation for exactly how you created them. (You think you’ll remember now, but 75 new variables later, you’ll thank me).

So once you create a new variable, you should of course immediately assign a Variable Label, and if appropriate, Value Labels and Missing Data Codes using Synta x.

Another thing that helps keep your new variable clean and interpretable is to assign the format. The default format is F8.2, which indicates a numerical value

You could go into the Variable View screen and manually change the Width and Decimals columns, which indicate how many characters go before and after (for numeric variables) the decimal point.

But why do all that when you can just use a single command to define multiple variables?

The syntax command is FORMATS. Here is the command for some common formats:

FORMATS NumVar1 NumVar2 (F5.0)
/NumVar3 (F6.1)
/StringVar1 (A15).

You can see the FORMATS command is followed by the variable names, then the format in parentheses.

Numeric variables NumVar1 and Numvar2 will both get the same format: with 5 digits, and nothing after the decimal.

Numeric variable NumVar3 will have 6 digits total, with one after the decimal.

And string variable (i.e. its value contain letters) StringVar1 is 15 characters wide.

This will get you started, but you can get all the specifics in the FORMATS section of the Command Syntax Reference, which is included in the SPSS help.

[Note: Edited explanation of F6.1 to be 6 digits total, not 6 digits before the decimal).

9 comments

Using Adjusted Means to Interpret Moderators in Analysis of Covariance

September 24th, 2010 by Karen Grace-Martin

If you’re like most researchers, your statistical training focused on Regression or ANOVA, but not both. It all depends on whether your field focuses more on experimental data (Biology, Psychology) or observed data (Sociology, Economics). Maybe one class covered a bit of the other, but most people are comfortable in one, but not the other.

This, in my opinion, is a shame. (Okay, I was going to say tragedy, but let’s be real. Tsunami that kills thousands=tragedy. Different scale here).

First of all, the distinction between ANOVA and linear regression is arbitrary. They’re really the same model with different outfits on.

Second, regardless of which one you normally use, you’re going to occasionally have to use the other kind of predictor variables–categorical or continuous. And we can come up with nice names for these models–a regression with dummy variables or an Analysis of Covariance.

But real understanding of the relationships among variables comes only when you dispense of the names and can focus on analyzing and interpreting the model using the kinds of variables you have.

There are other examples, but today I’m going to focus on an ANOVA model with a continuous covariate.

A common model is one in which one predictor is categorical (we’ll use 4 categories) and the other is continuous. Here is an example of a scatterplot of just such a model:

Scatterplot of Ancova

There are four groups, each of which received a different training. The continuous moderator is Age, and the outcome is OverallPost, which is the post-training test score to see how well they learned the material in each training program.

As you can see, the effect of the training program is moderated by age. Another way to say that is there is a significant interaction between Age and Training Group. The effect of the training is depending on the trainee’s age.

One way to interpret this significant interaction is to compare the slopes of the four lines, which is easily done with any regression coefficient table. (Okay, not always easily done, but easily found in…)

But this doesn’t make very much sense when Age is really a moderator–a predictor we want to control for, and see how it affects the relationship between the independent (IV) and dependent variables (DV), but not really the IV we’re interested in.

A better way to do it in this situation is to compare the means among groups at a low value of Age, say 20, and again at a high value of Age, say 50. You can get p-values, adjusted for multiple comparisons, using either SAS or SPSS GLM.

SAS Proc GLM uses the LSMeans statement and SPSS GLM uses EMMeans. They do the same thing–calculate the mean of Y for each group, at a specific value of the covariate.

If you use the menus in SPSS, you can only get those EMMeans at the Covariate’s mean, which in this example is about 25, where the vertical black line is. This isn’t very useful for our purposes. But we can change the value of the covariate at which to compare the means using syntax.

So it would tell us that at a young age of say 20, the three treatment groups (green, tan, and purple lines) all have means higher than the control (blue). Young people learned more in all three treatment groups.

But at an older age, say 50, the means of the purple and tan groups were not significantly different from the control group’s (blue), and the green (EIQ group) did worse!

In SPSS GLM, the syntax would be:

UNIANOVA OverallPost BY group WITH NEWAGE
/METHOD=SSTYPE(3)
/INTERCEPT=INCLUDE
/EMMEANS=TABLES(group) WITH(NEWAGE=MEAN) COMPARE ADJ(SIDAK)
/EMMEANS=TABLES(group) WITH(NEWAGE=45) COMPARE ADJ(SIDAK)
/EMMEANS=TABLES(group) WITH(NEWAGE=20) COMPARE ADJ(SIDAK)
/PRINT=PARAMETER
/CRITERIA=ALPHA(.05)
/DESIGN=NEWAGE group NEWAGE*group.

1 comment

The General Linear Model, Analysis of Covariance, and How ANOVA and Linear Regression Really are the Same Model Wearing Different Clothes

September 17th, 2010 by Karen Grace-Martin

Just recently, a client got some feedback from a committee member that the Analysis of Covariance (ANCOVA) model she ran did not meet all the assumptions.

Specifically, the assumption in question is that the covariate has to be uncorrelated with the independent variable.

This committee member is, in the strictest sense of how analysis of covariance is used, correct.

And yet, they over-applied that assumption to an inappropriate situation.

ANCOVA for Experimental Data

Analysis of Covariance was developed for experimental situations and some of the assumptions and definitions of ANCOVA apply only to those experimental situations.

The key situation is the independent variables are categorical and manipulated, not observed.

The covariate–continuous and observed–is considered a nuisance variable. There are no research questions about how this covariate itself affects or relates to the dependent variable.

The only hypothesis tests of interest are about the independent variables, controlling for the effects of the nuisance covariate.

A typical example is a study to compare the math scores of students who were enrolled in three different learning programs at the end of the school year.

The key independent variable here is the learning program. Students need to be randomly assigned to one of the three programs.

The only research question is about whether the math scores differed on average among the three programs. It is useful to control for a covariate like IQ scores, but we are not really interested in the relationship between IQ and math scores.

So in this example, in order to conclude that the learning program affected math scores, it is indeed important that IQ scores, the covariate, is unrelated to which learning program the students were assigned to.

You could not make that causal interpretation if it turns out that the IQ scores were generally higher in one learning program than the others.

So this assumption of ANCOVA is very important in this specific type of study in which we are trying to make a specific type of inference.

ANCOVA for Other Data

But that’s really just one application of a linear model with one categorical and one continuous predictor. The research question of interest doesn’t have to be about the causal effect of the categorical predictor, and the covariate doesn’t have to be a nuisance variable.

A regression model with one continuous and one dummy-coded variable is the same model (actually, you’d need two dummy variables to cover the three categories, but that’s another story).

The focus of that model may differ–perhaps the main research question is about the continuous predictor.

But it’s the same mathematical model.

The software will run it the same way. YOU may focus on different parts of the output or select different options, but it’s the same model.

And that’s where the model names can get in the way of understanding the relationships among your variables. The model itself doesn’t care if the categorical variable was manipulated. It doesn’t care if the categorical independent variable and the continuous covariate are mildly correlated.

If those ANCOVA assumptions aren’t met, it does not change the analysis at all. It only affects how parameter estimates are interpreted and the kinds of conclusions you can draw.

In fact, those assumptions really aren’t about the model. They’re about the design. It’s the design that affects the conclusions. It doesn’t matter if a covariate is a nuisance variable or an interesting phenomenon to the model. That’s a design issue.

The General Linear Model

So what do you do instead of labeling models? Just call them a General Linear Model. It’s hard to think of regression and ANOVA as the same model because the equations look so different. But it turns out they aren’t.

Regression and ANOVA model equations

If you look at the two models, first you may notice some similarities.

Both are modeling Y, an outcome.
Both have a “fixed” portion on the right with some parameters to estimate–this portion estimates the mean values of Y at the different values of X.
Both equations have a residual, which is the random part of the model. It is the variation in Y that is not affected by the Xs.

But wait a minute, Karen, are you nuts?–there are no Xs in the ANOVA model!

Actually, there are. They’re just implicit.

Since the Xs are categorical, they have only a few values, to indicate which category a case is in. Those j and k subscripts? They’re really just indicating the values of X.

(And for the record, I think a couple Xs are a lot easier to keep track of than all those subscripts. Ever have to calculate an ANOVA model by hand? Just sayin’.)

So instead of trying to come up with the right label for a model, focus instead on understanding (and describing in your paper) the measurement scales of your variables, if and how much they’re related, and how that affects the conclusions.

In my client’s situation, it was not a problem that the continuous and the categorical variables were mildly correlated. The data were not experimental and she was not trying to draw causal conclusions about only the categorical predictor.

So she had to call this ANCOVA model a multiple regression.

18 comments

Cross-tabulation in Cohort and Case-Control Studies

September 3rd, 2010 by guest contributer

by Annette Gerritsen, Ph.D.

Cross-tabulation in cohort studies

Assume you have just done a cohort study. How do you actually do the cross-tabulation to calculate the cumulative incidence in both groups?

Best is to always put the outcome variable (disease yes/no) in the columns and the exposure variable in the rows. In other words, put the dependent variable–the one that describes the problem under study–in the columns. And put the independent variable–the factor assumed to cause the problem–in the rows.

Let’s take as example a cohort study used to see whether there is a causal relationship between the use of a certain water source and the incidence of diarrhea among children under five in a village with different water sources. In this case, the variable diarrhea (yes/no) should be in the columns. The variable water source (suspected/other) should be in the rows.

SPSS will put the lowest value of the variable in the first column or row. So in order to get those with diarrhea in the first column you should label ‘diarrhea’ as 1 and ‘no diarrhea’ as 2. The same is true for the exposure variable: label the ‘suspected water source’ as 1 and the ‘other water source’ as 2.

You will then be able to calculate the cumulative incidence (risk of developing the disease) among those with the exposure: a / (a + b) and among those without the exposure: c / (c + d).

In the case of the diarrhea study (Table 1), you could calculate the cumulative incidence of diarrhea among those exposed to the suspected water source, which would be (78 / 1,500 =) 5.2%.

You can also do this for those exposed to other water sources, which would be (50 / 1,000 =) 5.0%.

SPSS can give you these percentages immediately (in cell ‘a’ and ‘c’ respectively), when you ask to display row percentages in the Cells option (Table 2).

Cross-tabulation in Case-Control Studies

When you have used a case-control design for the diarrhea study, the actual cross-tabulation is quite similar, only “presence of diarrhea yes/no”, is now changed into “cases” and “controls.

Label the cases as 1, and the controls as 2. Be aware that row percentages have no meaning in terms of occurrence of disease in case-control studies. This is because in case-control studies the researcher determines how many patients and how many controls are included.

The ratio between the number of patients and controls (e.g. 2 : 1 or 4 : 1) influences the row percentages. So in a case-control study, the cumulative incidence cannot be calculated.

When having conducted a case-control study, you can ask to display column percentages. That gives you the proportion of those exposed to the suspected water source among the cases (in cell ‘a’) and among the controls (in cell ‘b’).

Table 3 gives the SPSS output for the same diarrhea study assuming that it had a case-control design. Using the data provided, (78 / 128 =) 60.9% of the cases were exposed to the suspected water source, while this was (1,422 / 2,372 =) 59.9% of the controls (asked for column percentages).

Another article will be devoted to measures of association: How do you actually compare cumulative incidence rates in cohort studies? And what measure of association can be used in case-control studies?

About the Author: With expertise in epidemiology, biostatistics and quantitative research projects, Annette Gerritsen, Ph.D. provides services to her clients focussing on the methodological soundness of each phase of an epidemiological study to ensure getting valid answers to the proposed research questions. She is the founder of Epi Result.

1 comment

The Data Analysis Work Flow: 9 Strategies for Keeping Track of your Analyses and Output

August 13th, 2010 by Karen Grace-Martin

Knowing the right statistical analysis to use in any data situation, knowing how to run it, and being able to understand the output are all really important skills for statistical analysis. Really important.

But they’re not the only ones.

Another is having a system in place to keep track of the analyses. This is especially important if you have any collaborators (or a statistical consultant!) you’ll be sharing your results with. You may already have an effective work flow, but if you don’t, here are some strategies I use. I hope they’re helpful to you.

1. Always use Syntax Code

All the statistical software packages have come up with some sort of easy-to-use, menu-based approach. And as long as you know what you’re doing, there is nothing wrong with using the menus. While I’m familiar enough with SAS code to just write it, I use menus all the time in SPSS.

But even if you use the menus, paste the syntax for everything you do. There are many reasons for using syntax, but the main one is documentation. Whether you need to communicate to someone else or just remember what you did, syntax is the only way to keep track. (And even though, in the midst of analyses, you believe you’ll remember how you did something, a week and 40 models later, I promise you won’t. I’ve been there too many times. And it really hurts when you can’t replicate something).

In SPSS, there are two things you can do to make this seamlessly easy. First, instead of hitting OK, hit Paste. Second, make sure syntax shows up on the output. This is the default in later versions, but you can turn in on in Edit–>Options–>Viewer. Make sure “Display Commands in Log” and “Log” are both checked. (Note: the menus may differ slightly across versions).

2. If your data set is large, create smaller data sets that are relevant to each set of analyses.

First, all statistical software needs to read the entire data set to do many analyses and data manipulation. Since that same software is often a memory hog, running anything on a large data set will s-l-o-w down processing. A lot.

Second, it’s just clutter. It’s harder to find the variables you need if you have an extra 400 variables in the data set.

3. Instead of just opening a data set manually, use commands in your syntax code to open data sets.

Why? Unless you are committing the cardinal sin of overwriting your original data as you create new variables, you have multiple versions of your data set. Having the data set listed right at the top of the analysis commands makes it crystal clear which version of the data you analyzed.

4. Use Variable and Value labels religiously

I know you remember today that your variable labeled Mar4cat means marital status in 4 categories and that 0 indicates ‘never married.’ It’s so logical, right? Well, it’s not obvious to your collaborators and it won’t be obvious to you in two years, when you try to re-analyze the data after a reviewer doesn’t like your approach.

Even if you have a separate code book, why not put it right in the data? It makes the output so much easier to read, and you don’t have to worry about losing the code book. It may feel like more work upfront, but it will save time in the long run.

5. Put data manipulation, descriptive analyses, and models in separate syntax files

When I do data analysis, I follow my Steps approach, which means first I create all the relevant variables, then run univariate and bivariate statistics, then initial models, and finally hone the models.

And I’ve found that if I keep each of these steps in separate program files, it makes it much easier to keep track of everything. If you’re creating new variables in the middle of analyses, it’s going to be harder to find the code so you can remember exactly how you created that variable.

6. As you run different versions of models, label them with model numbers

When you’re building models, you’ll often have a progression of different versions. Especially when I have to communicate with a collaborator, I’ve found it invaluable to number these models in my code and print that model number on the output. It makes a huge difference in keeping track of nine different models.

7. As you go along with different analyses, keep your syntax clean, even if the output is a mess.

Data analysis is a bit of an iterative process. You try something, discover errors, realize that variable didn’t work, and try something else. Yes, base it on theory and have a clear analysis plan, but even so, the first analyses you run won’t be your last.

Especially if you make mistakes as you go along (as I inevitably do), your output gets pretty littered with output you don’t want to keep. You could clean it up as you go along, but I find that’s inefficient. Instead, I try to keep my code clean, with only the error-free analyses that I ultimately want to use. It lets me try whatever I need to without worry. Then at the end, I delete the entire output and just rerun all code.

One caveat here: You may not want to go this approach if you have VERY computing intensive analyses, like a generalized linear mixed model with crossed random effects on a large data set. If your code takes more than 20 minutes to run, this won’t be more efficient.

8. Use titles and comments liberally

I’m sure you’ve heard before that you should use lots of comments in your syntax code. But use titles too. Both SAS and SPSS have title commands that allow titles to be printed right on the output. This is especially helpful for naming and numbering all those models in #6.

9. Name output, log, and programs the same

Since you’ve split your programs into separate files for data manipulations, descriptives, initial models, etc. you’re going to end up with a lot of files. What I do is name each output the same name as the program file. (And if I’m in SAS, the log too-yes, save the log).

Yes, that means making sure you have a separate output for each section. While it may seem like extra work, it can make looking at each output less overwhelming for anyone you’re sharing it with.

9 comments

On Data Integrity and Cleaning

July 30th, 2010 by Karen Grace-Martin

This year I hired a Quickbooks consultant to bring my bookkeeping up from the stone age. (I had been using Excel).

She had asked for some documents with detailed data, and I tried to send her something else as a shortcut. I thought it was detailed enough. It wasn’t, so she just fudged it. The bottom line was all correct, but the data that put it together was all wrong.

I hit the roof.Internally, only—I realized it was my own fault for not giving her the info she needed. She did a fabulous job.

But I could not leave the data fudged, even if it all added up to the right amount, and already reconciled. I had to go in and spend hours fixing it. Truthfully, I was a bit of a compulsive nut about it.

And then I had to ask myself why I was so uptight—if accountants think the details aren’t important, why do I? Statisticians are all about approximations and accountants are exact, right?

As it turns out, not so much.

But I realized I’ve had 20 years of training about the importance of data integrity. Sure, the results might be inexact, the analysis, the estimates, the conclusions. But not the data. The data must be clean.

Sparkling, if possible.

In research, it’s okay if the bottom line is an approximation. Because we’re never really measuring the whole population. And we can’t always measure precisely what we want to measure. But in the long run, it all averages out.

But only if the measurements we do have are as accurate as they possibly can be.

2 comments