Statwise: August 2015

Volume 8, Issue 1	August 2015

A Note From Karen

Karen Grace-Martin Photo I'm happy to say that I was able to meet a number of you in person during a recent conference-related trip to Cleveland then Seattle. As much as I love this virtual business, it's always nice to actually have in-person conversations.

For many of us, it's time to start the ramp up for fall.

Before I do, I'm going to have one more week out of the office for a family camping trip, but will be back on September 8th. So if you need consulting, I'll be back soon.

While I'm gone, David Lillis' Generalized Linear Models in R workshop will begin. These models are incredibly flexible and include many of the models you use often, like logistic and count regression models.

They also include a few models you may not have heard of, including one I describe in this month'sarticle. I hope you find it helpful.

Happy analyzing!
Karen

Feature Article: Zero One Inflated Beta Models for Proportion Data

Proportion and percentage data are tricky to analyze.

Much like count data, they look like they should work in a linear model.

They're numerical. They're often continuous.

And sometimes they do work. Some proportion data do look normally distributed so estimates and p-values are reasonable.

But more often they don't. So estimates and p-values are a mess. Luckily, there are other options. One is beta regression.

Beta Regression

Like logistic and Poisson regression, beta regression is a type of generalized linear model.

It works nicely for proportion data because the values of a variable with a beta distribution must fall between 0 and 1.

It's a bit of a funky distribution in that it's shape can change a lot depending on the values of the mean and dispersion parameters.

Here are a few examples of the possible shapes of a beta distribution, with different means and variances:

You can see that in some, the distribution looks quite normal. It that situation, you would get reasonable estimates and p-values if you assumed normality.

But here is just the kind of sticky situation you commonly see in real data. Let's say you want to compare the mean proportion of days out of 30 that people do some behavior--take their prescribed medication, exercise for at least 30 minutes, or act physically aggressively toward peers.

Maybe you've got some intervention that you want to test will help people take their medications. Perhaps the control group indeed looks like the nice normal distribution in the third graph above.

But the treatment worked so well that in the intervention group, the distribution is highly skewed. It looks like the last graph.

Assuming normality isn't going to work here. That's where a beta regression can work instead.

One big problem.

0 and 1 aren't possible values in a beta distribution. So if Y|X follows a beta distribution, Y can have values close to 0 and 1, say .001 or .998. But not 0 or 1 exactly.

So if a client takes their medication 30 out of 30 days, a beta regression won't run. You can't have any 0s or 1s in the data set.

Zero-One Inflated Beta Models

There is, however, a version of beta regression model that can work in this situation. It's one of those models that has been around in theory for a while, but is only in the past few years become available in (some) mainstream statistical software.

It's called a Zero-One-Inflated Beta and it works very much like a Zero-Inflated Poisson model.

It's a type of mixture model that says there are really three processes going on.

One is a process that distinguishes between zeros and non-zeros. The idea is there is something qualitatively different about people who never take their medication than those who do, at least sometimes.

Likewise, there is a process that distinguishes between ones and non-ones. Again, there is something qualitatively different about people who always take their medication than those who do sometimes or never.

And then there is a third process that determines how much someone takes their medication if they do some of the time.

The first and second processes are run through a logistic regression and the third through a beta regression.

These three models are run simultaneously. They can each have their own set of predictors and their own set of coefficients. For example, maybe memory is a big predictor of how often someone takes their medication if they take it sometimes, but not at all an issue for whether or not someone takes it 0 times. Perhaps those people aren't forgetting--they can't afford to purchase it.

So maybe whether someone has health insurance that pays for the medication is a predictor in the zero/non-zero logistic regression, but not in the other two parts.

Depending on the shape of the distribution, you may not need all three processes. If there are no zeros in the data set, you may only need to accommodate inflation at 1.

It's highly flexible and adds important options to your data analysis toolbox.

References and Further Reading:

ZOIB: Stata module to fit a zero-one inflated beta distribution by maximum likelihood

zoib: R package for Bayesian Inference of zero-one inflated beta regression

This Month's Data Analysis Brown Bag Webinar

Latent Class Analysis

Upcoming Workshops:

Generalized Linear Models in R

Analyzing Repeated Measures Data: ANOVA and Mixed Model Approaches

Quick Links

The Analysis Factor

The Analysis Institute

More About Us

You received this email because you subscribed to The Analysis Factor's list community. To change your subscription, see the link at end of this email.

Please forward this to anyone you know who might benefit. If you received this from a friend, sign up for this email newsletter here.

About Us

What is The Analysis Factor? The Analysis Factor is the difference between knowing about statistics and knowing how to use statistics in data analysis. It acknowledges that statistical analysis is an applied skill. It requires learning how to use statistical tools within the context of a researcher's own data, and supports that learning.

The Analysis Factor, the organization, offers statistical consulting, resources, and learning programs that empower researchers to become confident, able, and skilled statistical practitioners. Our aim is to make your journey acquiring the applied skills of statistical analysis easier and more pleasant.

Karen Grace-Martin, the founder, spent seven years as a statistical consultant at Cornell University. While there, she learned that being a great statistical advisor is not only about having excellent statistical skills, but about understanding the pressures and issues researchers face, about fabulous customer service, and about communicating technical ideas at a level each client understands.

You can learn more about Karen Grace-Martin and The Analysis Factor at theanalysisfactor.com.

Please forward this newsletter to colleagues who you think would find it useful. Your recommendation is how we grow.

If you received this email from a friend or colleague, click here to subscribe to this newsletter.

Need to change your email address? See below for details.

No longer wish to receive this newsletter? See below to cancel.