Karen Grace-Martin

Series on Confusing Statistical Terms

December 3rd, 2009 by

One of the biggest challenges in learning statistics and data analysis is learning the lingo.  It doesn’t help that half of the notation is in Greek (literally).

The terminology in statistics is particularly confusing because often the same word or symbol is used to mean completely different concepts.

I know it feels that way, but it really isn’t a master plot by statisticians to keep researchers feeling ignorant.

Really.

It’s just that a lot of the methods in statistics were created by statisticians working in different fields–economics, psychology, medicine, and yes, straight statistics.  Certain fields often have specific types of data that come up a lot and that require specific statistical methodologies to analyze.

Economics needs time series, psychology needs factor analysis.  Et cetera, et cetera.

But separate fields developing statistics in isolation has some ugly effects.

Sometimes different fields develop the same technique, but use different names or notation.

Other times different fields use the same name or notation on different techniques they developed.

And of course, there are those terms with slightly different names, often used in similar contexts, but with different meanings. These are never used interchangeably, but they’re easy to confuse if you don’t use this stuff every day.

And sometimes, there are different terms for subtly different concepts, but people use them interchangeably.  (I am guilty of this myself).  It’s not a big deal if you understand those subtle differences.  But if you don’t, it’s a mess.

And it’s not just fields–it’s software, too.

SPSS uses different names for the exact same thing in different procedures.  In GLM, a continuous independent variable is called a Covariate.  In Regression, it’s called an Independent Variable.

Likewise, SAS has a Repeated statement in its GLM, Genmod, and Mixed procedures.  They all get at the same concept there (repeated measures), but they deal with it in drastically different ways.

So once the fields come together and realize they’re all doing the same thing, people in different fields or using different software procedures, are already used to using their terminology.  So we’re stuck with different versions of the same word or method.

So anyway, I am beginning a series of blog posts to help clear this up.  Hopefully it will be a good reference you can come back to when you get stuck.

We’ve expanded on this list with a member training, if you’re interested.

If you have good examples, please post them in the comments.  I’ll do my best to clear things up.

 

Why Statistics Terminology is Especially Confusing

Confusing Statistical Term #1: Independent Variable

Confusing Statistical Terms #2: Alpha and Beta

Confusing Statistical Term #3: Levels

Confusing Statistical Terms #4: Hierarchical Regression vs. Hierarchical Model

Confusing Statistical Term #5: Covariate

Confusing Statistical Term #6: Factor

Same Statistical Models, Different (and Confusing) Output Terms

Confusing Statistical Term #7: GLM

Confusing Statistical Term #8: Odds

Confusing Statistical Term #9: Multiple Regression Model and Multivariate Regression Model

Confusing Statistical Term #10: Mixed and Multilevel Models

Confusing Statistical Terms #11: Confounder

Six terms that mean something different statistically and colloquially

Confusing Statistical Term #13: MAR and MCAR Missing Data

 


Sharing SPSS Output across Versions

November 18th, 2009 by

If you’ve ever tried sharing SPSS output with your collaborators, advisor, or statistical consultant, you have surely noticed that the output is often not compatible across different versions of SPSS.

And if you work in a company where everyone is working on the same site license, it’s not a problem.  But if you’re collaborating with colleagues at different universities on different upgrade schedules, you might run into some problems.

It’s true that most software programs aren’t back-compatible.  You can’t read documents created in newer versions in older versions of software.

But SPSS’s sharing capabilities are more, um, interesting.

The syntax and data files are back and forward-compatible across many versions, at least since v9 or so.  (I don’t (more…)


Stocking the Data Analyst’s Bookshelf

November 16th, 2009 by

Many years ago, when I was teaching in a statistics department, I had my first consulting gig. Two psychology researchers didn’t know how to analyze their paired rank data. Unfortunately, I didn’t either. I asked a number of statistics colleagues (who didn’t know either), then finally borrowed a nonparametrics book. The answer was right there. (If you’re curious, it was a Friedman test.)

But the bigger lesson for me was the importance of a good reference library. No matter how much statistical training and experience you have, you won’t remember every detail about every statistical test. And you don’t need to. You just need to have access to the information and be able to understand it.

My statistics library consists of a collection of books, software manuals, articles, and web sites. Yet even in the age of Google, the heart of my library is still books. I use Google when I need to look something up, but it’s often not as quick as I’d hoped, and I don’t always find the answer. I rely on my collection of good reference books that I KNOW will have the answer I’m looking for (and continually add to it).

Not all statistics books are equally helpful in every situation. I divide books into four categories– Reference Books, Software Books, Applied Statistics Books, and data analysis books. My library has all four, and yours should too, if data analysis is something you’ll be doing long-term. I’ve included examples for running logistic regression in SAS, so you can compare the four types.

1. Reference Books are often text books. They are filled with formulas, theory, and exercises, as well as explanations. As a data analyst, not a student, you can skip most of it and go right for the explanations or formula you need. While I find most text books aren’t useful for learning HOW to do a new statistical method on your own, they are great references for already-familiar methods.

While I have a few favorites, the best one is often the one you already own and are familiar with, i.e. the textbooks you used in your stats classes. Hopefully, you didn’t sell back your stats text books (or worse, have the post office lose them in your cross-country move, like I did).

Example: Alan Agresti’s Categorical Data Analysis.

2. Statistical Software Books focus on using a software package. They tend to be general, often starting from the beginning, and cover everything from entering and manipulating data to advanced statistical techniques. This is the type of book to use when learning a new package or area of a package. They don’t, however, usually tell you much about the actual statistics–what it means, why to use it, or when different options make sense. And these are not manuals–they are usually written by users of the software, and are much better for learning a software program. (I think of learning a software program like learning French from a French dictionary–not so good).

Example: Ron Cody & Jeffrey Smith’s  Applied Statistics and the SAS Programming Language

3. Applied Statistics Books are written for researchers. The focus is not on the formulas, as text books are, but on meaning and use of the statistics. Good applied statistics books are fabulous for learning a new technique when you don’t have time for a semester-length class, but you will have to have a reasonably strong statistical background to read or use them well. They aren’t for beginners. The nice thing about applied statistics books is they are not tied to any piece of software, so they’re useful to anyone. That is also their limitation, though–they won’t guide you through the actual analysis in your package.

Example: Scott Menard’s Applied Logistic Regression Analysis

4. Statistical Analysis Books are a hybrid between applied statistics and statistical software books. They explain both the steps to the software AND what it all means. There aren’t many of these, but many of the ones that exist are great. The only problem is they are often published by the software companies, so each one only exists for one software package. If it’s not the one you use, they’re less useful. But they are often great anyway as Applied Statistics books.

Example: Paul Allison’s Logistic Regression using the SAS System: Theory and Application

If you are without reference books you like, buy them used. Unlike students, you don’t need the latest edition. Most areas of statistics don’t change that much. Linear regression isn’t getting new assumptions, and factor analysis isn’t getting new rotations. Unless it’s in an area of statistics that is still developing, like multilevel modeling and missing data, you’re pretty safe with a 10 year old version.

And it does help to buy them. Use your institution’s library to supplement your personal library. Even if it’s great, getting to that library is an extra barrier, and waiting a few weeks for the recall or interlibrary loan is sometimes too long.

I have bought used textbooks for $10. Menard’s book, and all of the excellent Sage series, are only $17, new. So it doesn’t have to cost a fortune to build a library. Even so, paying $70 for a book is sometimes completely worth it. Having the information you need will save you hours, or even days of work. How much is your time and energy worth? If you plan to do data analysis long term, invest a little each year in statistical reference books.

The full list of all four types of books Karen recommends is on The Analysis Factor Bookshelf page.

If you know of any other great books we should recommend, comment below.  I’m always looking for good books to recommend.

 


Chi-square test vs. Logistic Regression: Is a fancier test better?

November 9th, 2009 by

I recently received this email, which I thought was a great question, and one of wider interest…

Hello Karen,
I am an MPH student in biostatistics and I am curious about using regression for tests of associations in applied statistical analysis.  Why is using regression, or logistic regression “better” than doing bivariate analysis such as Chi-square?

I read a lot of studies in my graduate school studies, and it seems like half of the studies use Chi-Square to test for association between variables, and the other half, who just seem to be trying to be fancy, conduct some complicated regression-adjusted for-controlled by- model. But the end results seem to be the same. I have worked with some professionals that say simple is better, and that using Chi- Square is just fine, but I have worked with other professors that insist on building models. It also just seems so much more simple to do chi-square when you are doing primarily categorical analysis.

My professors don’t seem to be able to give me a simple justified
answer, so I thought I’d ask you. I enjoy reading your site and plan to begin participating in your webinars.

Thank you!

(more…)


Have you Wondered how using SPSS Burns Calories?

October 30th, 2009 by

Maybe I’ve noticed it more because I’m getting ready for next week’s SPSS in GLM workshop. Just this week, I’ve had a number of experiences with people’s struggle with SPSS, and GLM in particular.

Number 1: I read this in a technical report by Patrick Burns comparing SPSS to R:

“SPSS is notorious for its attitude of ‘You want to do one of these things. If you don’t understand what the output means, click help and we’ll pop up five lines of mumbo-jumbo that you’re not going to understand either.’ “

And while I still prefer SPSS, I had to laugh because the anonymous person Burns (more…)


3 Pieces of SPSS Syntax to Keep Handy

October 23rd, 2009 by

I hope you’re getting started using SPSS Syntax by hitting that Paste button when you use the menus.

But there are a few parts of SPSS you can’t do that with. Specifically, there are syntax commands for doing all the variable definitions that you usually fill out in the “Variable View” window. But there are no Paste buttons there, so you have to know how to write the syntax from scratch.

I find the three variable definitions that I use the most are defining Variable Labels, Value Labels and Missing Data codes. The syntax is simple and logical for all three, so I’m going to just give you the basic code, which you can keep on hand and edit as you need.

For a data set with the variables Gender, Smoke, and Exercise, with the following definitions:

Gender: 0=Male, 1=Female
Smoke: 1=Never 2=Sometimes 3=Daily
Exercise: 1=Never 2=Sometimes 3=Daily

For all three variables, 999 = a user-defined missing value

We could use the following code to give descriptive variable labels, encode the value labels, and define the missing data:

VARIABLE LABELS
GENDER ‘Participant Gender’
SMOKE ‘Does Participant ever Smoke Cigarettes?’
EXERCISE ‘How Often Does Participant Exercise for a30 Minute Period?’.

Notice two things:
1. I could put all three Variable labels in the same Variable Label statement
2. There is a period at the end of the statement. This is required.

VALUE LABELS
GENDER 0 ‘Male’ 1 ‘Female
/SMOKE EXERCISE
1 ‘Never’
2 ‘Sometimes’
3 ‘Daily’.

MISSING VALUES
GENDER SMOKE EXERCISE (999).

Since all three variables have the same missing data code, I could include them all in the same statement.

There are, of course syntax rules for all of these commands, but you can easily look them up in the Command Syntax Manual.

Want to learn more? If you’re just getting started with data analysis in SPSS, or would like a thorough refresher, please join us in our online workshop Introduction to Data Analysis in SPSS.