OptinMon

Multiple Imputation: 5 Recent Findings that Change How to Use It

March 24th, 2010 by

Missing Data, and multiple imputation specifically, is one area of statistics that is changing rapidly. Research is still ongoing, and each year new findings on best practices and new techniques in software appear.

The downside for researchers is that some of the recommendations missing data statisticians were making even five years ago have changed.

Remember that there are three goals of multiple imputation, or any missing data technique: Unbiased parameter estimates in the final analysis (more…)


Mediators, Moderators, and Suppressors: What IS the difference?

March 10th, 2010 by

One of the biggest questions I get is about the difference between mediators, moderators, and how they both differ from control variables.Stage 2

I recently found a fabulous free video tutorial on the difference between mediators, moderators, and suppressor variables, by Jeremy Taylor at Stats Make Me Cry.   The witty example is about the different types of variables–talent, practice, etc.–that explain the relationship between having a guitar and making lots of $$.

 


Another Great SPSS book: SPSS Programming and Data Management

March 3rd, 2010 by

Have you ever needed to do some major data management in SPSS and ended up with a syntax program that’s pages long?  This is the kind you couldn’t even do with the menus, because you’d tear your hair out with frustration because it took you four weeks to create some new variables.

I hope you’ve gotten started using Syntax, which not only gives you a record of how you’ve recoded and created all those new variables and exactly which options you chose in the  data analysis you’ve done.

But once you get started, you start to realize that some things feel a little clunky.  You have to run the same descriptive analysis on 47 different variables.  And while cutting and pasting is a heck of a lot easier than doing that in the menus, you wonder if there isn’t a better way.

There is.

SPSS syntax actually has a number of ways to increase programming efficiency, including macros, do loops, repeats.

I admit I haven’t used this stuff a lot, but I’m increasingly seeing just how useful it can be.  I’m much better trained in doing these kinds of things in SAS, so I admit I have been known to just import data into SAS to run manipulations.

But I just came across a great resources on doing sophisticated SPSS Syntax Programming, and it looks like some fabulous bedtime reading.  (Seriously).

And the best part is you can download it (or order it, if you’d like a copy to take to bed) from the author’s website, Raynald’s SPSS Tools, itself a great source of info on mastering SPSS.

So once you’ve gotten into the habit of hitting Paste instead of Okay, and gotten a bit used to SPSS syntax, and you’re ready to step your skills up a notch, this looks like a fabulous book.

[Edit]: As per Jon Peck in the comments below, the most recent version is now available at www.ibm.com/developerworks/spssdevcentral under Books and Articles.

Want to learn more? If you’re just getting started with data analysis in SPSS, or would like a thorough refresher, please join us in our online workshop Introduction to Data Analysis in SPSS.

 


A Few Resources on Zero-Inflated Poisson Models

February 15th, 2010 by

1. For a general overview of modeling count variables, you can get free access to the video recording of one of my The Craft of Statistical Analysis Webinars:

Poisson and Negative Binomial for Count Outcomes

2. One of my favorite books on Categorical Data Analysis is:

Long, J. Scott. (1997).  Regression models for Categorical and Limited Dependent Variables.  Sage Publications.

It’s moderately technical, but written with social science researchers in mind.  It’s so well written, it’s worth it.  It has a section specifically about Zero Inflated Poisson and Zero Inflated Negative Binomial regression models.

3. Slightly less technical, but most useful only if you use Stata is >Regression Models for Categorical Dependent Variables Using Stata, by J. Scott Long and Jeremy Freese.

4. UCLA’s ATS Statistical Software Consulting Group has some nice examples of Zero-Inflated Poisson and other models in various software packages.

 


Zero-Inflated Poisson Models for Count Outcomes

February 12th, 2010 by

There are quite a few types of outcome variables that will never meet ordinary linear model’s assumption of normally distributed residuals.  A non-normal outcome variable can have normally distribued residuals, but it does need to be continuous, unbounded, and measured on an interval or ratio scale.   Categorical outcome variables clearly don’t fit this requirement, so it’s easy to see that an ordinary linear model is not appropriate.  Neither do count variables.  It’s less obvious, because they are measured on a ratio scale, so it’s easier to think of them as continuous, or close to it.  But they’re neither continuous or unbounded, and this really affects assumptions.

Continuous variables measure how much.  Count variables measure how many.  Count variables can’t be negative—0 is the lowest possible value, and they’re often skewed–so severly that 0 is by far the most common value.  And they’re discrete, not continuous.  All those jokes about the average family having 1.3 children have a ring of truth in this context.

Count variables often follow a Poisson or one of its related distributions.  The Poisson distribution assumes that each count is the result of the same Poisson process—a random process that says each counted event is independent and equally likely.  If this count variable is used as the outcome of a regression model, we can use Poisson regression to estimate how predictors affect the number of times the event occurred.

But the Poisson model has very strict assumptions.  One that is often violated is that the mean equals the variance.  When the variance is too large because there are many 0s as well as a few very high values, the negative binomial model is an extension that can handle the extra variance.

But sometimes it’s just a matter of having too many zeros than a Poisson would predict.  In this case, a better solution is often the Zero-Inflated Poisson (ZIP) model.  (And when extra variation occurs too, its close relative is the Zero-Inflated Negative Binomial model).

ZIP models assume that some zeros occurred by a Poisson process, but others were not even eligible to have the event occur.  So there are two processes at work—one that determines if the individual is even eligible for a non-zero response, and the other that determines the count of that response for eligible individuals.

The tricky part is either process can result in a 0 count.   Since you can’t tell which 0s were eligible for a non-zero count, you can’t tell which zeros were results of which process.  The ZIP model fits, simultaneously, two separate regression models.  One is a logistic or probit model that models the probability of being eligible for a non-zero count.  The other models the size of that count.

Both models use the same predictor variables, but estimate their coefficients separately.  So the predictors can have vastly different effects on the two processes.

But a ZIP model requires it be theoretically plausible that some individuals are ineligible for a count.  For example, consider a count of the number of disciplinary incidents in a day in a youth detention center.  True, there may be some youth who would never instigate an incident, but the unit of observation in this case is the center.  It is hard to imagine a situation in which a detention center would have no possibility of any incidents, even if they didn’t occur on some days.

Compare that to the number of alcoholic drinks consumed in a day, which could plausibly be fit with a ZIP model.  Some participants do drink alcohol, but will have consumed 0 that day, by chance.   But others just do not drink alcohol, so will never have a non-zero response.  The ZIP model can determine which predictors affect the probability of being an alcohol consumer and which predictors affect how many drinks the consumers consume.  They may not be the same predictors for the two models, or they could even have opposite effects on the two processes.

 


What Makes a Statistical Analysis Wrong?

January 21st, 2010 by

One of the most anxiety-laden questions I get from researchers is whether their analysis is “right.”

I’m always slightly uncomfortable with that word. Often there is no one right analysis.

It’s like finding Mr. or Ms. Right. Most of the time, there is not just one Right. But there are many that are clearly Wrong.

What Makes an Analysis Right?

Luckily, what makes an analysis right is easier to define than what makes a person right for you. It pretty much comes down to two things: whether the assumptions of the statistical method are being met and whether the analysis answers the research question.

Assumptions are very important. A test needs to reflect the measurement scale of the variables, the study design, and issues in the data. A repeated measures study design requires a repeated measures analysis. A binary dependent variable requires a categorical analysis method.

But within those general categories, there are often many analyses that meet assumptions. A logistic regression or a chi-square test both handle a binary dependent variable with a single categorical predictor. But a logistic regression can answer more research questions. It can incorporate covariates, directly test interactions, and calculate predicted probabilities. A chi-square test can do none of these.

So you get different information from different tests. They answer different research questions.

An analysis that is correct from an assumptions point of view is useless if it doesn’t answer the research question. A data set can spawn an endless number of statistical tests that don’t answer the research question. And you can spend an endless number of days running them.

When to Think about the Analysis

The real bummer is it’s not always clear that the analyses aren’t relevant until you  write up the research paper.

That’s why writing out the research questions in theoretical and operational terms is the first step of any statistical analysis. It’s absolutely fundamental. And I mean writing them in minute detail. Issues of mediation, interaction, subsetting, control variables, et cetera, should all be blatantly obvious in the research questions.

Thinking about how to analyze the data before collecting the data can help you from hitting a dead end. It can be very obvious, once you think through the details, that the analysis available to you based on the data won’t answer the research question.

Whether the answer is what you expected or not is a different issue.

So when you are concerned about getting an analysis “right,” clearly define the design, variables, and data issues, but most importantly, get explicitly clear about what you want to learn from this analysis.

Once you’ve done this, it’s much easier to find the statistical method that answers the research questions and meets assumptions. Even if you don’t know the right method, you can narrow your search with clear guidance.