Some terms mean one thing in the English language, but have another (usually more specific) meaning in statistics. [Read more…] about Member Training: Confusing Statistical Terms
In the past few months, I’ve gotten the same question from a few clients about using linear mixed models for repeated measures data. They want to take advantage of its ability to give unbiased results in the presence of missing data. In each case the study has two groups complete a pre-test and a post-test measure. Both of these have a lot of missing data.
The research question is whether the groups have different improvements in the dependent variable from pre to post test.
As a typical example, say you have a study with 160 participants.
90 of them completed both the pre and the post test.
Another 48 completed only the pretest and 22 completed only the post-test.
Repeated Measures ANOVA will deal with the missing data through listwise deletion. That means keeping only the 90 people with complete data. This causes problems with both power and bias, but bias is the bigger issue.
Another alternative is to use a Linear Mixed Model, which will use the full data set. This is an advantage, but it’s not as big of an advantage in this design as in other studies.
The mixed model will retain the 70 people who have data for only one time point. It will use the 48 people with pretest-only data along with the 90 people with full data to estimate the pretest mean.
Likewise, it will use the 22 people with posttest-only data along with the 90 people with full data to estimate the post-test mean.
If the data are missing at random, this will give you unbiased estimates of each of these means.
But most of the time in Pre-Post studies, the interest is in the change from pre to post across groups.
The difference in means from pre to post will be calculated based on the estimates at each time point. But the degrees of freedom for the difference will be based only on the number of subjects who have data at both time points.
So with only two time points, if the people with one time point are no different from those with full data (creating no bias), you’re not gaining anything by keeping those 72 people in the analysis.
Compare this to a study I also saw in consulting with 5 time points. Nearly all the participants had 4 out of the 5 observations. The missing data was pretty random–some participants missed time 1, others, time 4, etc. Only 6 people out of 150 had full data. Listwise deletion created a nightmare, leaving only 6 people in the data set.
Each person contributed data to 4 means, so each mean had a pretty reasonable sample size. Since the missingness was random, each mean was unbiased. Each subject fully contributed data and df to many of the mean comparisons.
With more than 2 time points and data that are missing at random, each subject can contribute to some change measurements. Keep that in mind the next time you design a study.
You may have never heard of listwise deletion for missing data, but you’ve probably used it.
Listwise deletion means that any individual in a data set is deleted from an analysis if they’re missing data on any variable in the analysis.
It’s the default in most software packages.
Although the simplicity of it is a major advantage, it causes big problems in many missing data situations.
But not always. If you happen to have one of the uncommon missing data situations in which [Read more…] about When Listwise Deletion works for Missing Data
Q: Do most high impact journals require authors to state which method has been used on missing data?
I don’t usually get far enough in the publishing process to read journal requirements.
But based on my conversations with researchers who both review articles for journals and who deal with reviewers’ comments, I can offer this response.
I would be shocked if journal editors at top journals didn’t want information about the missing data technique. If you leave it out, they’ll either assume you didn’t have missing data or are using defaults like listwise deletion. [Read more…] about Do Top Journals Require Reporting on Missing Data Techniques?
Do you find quizzes irresistible? I do.
Here’s a little quiz about working with missing data:
True or False?
1. Imputation is really just making up data to artificially inflate results. It’s better to just drop cases with missing data than to impute.
2. I can just impute the mean for any missing data. It won’t affect results, and improves power.
3. Multiple Imputation is fine for the predictor variables in a statistical model, but not for the response variable.
4. Multiple Imputation is always the best way to deal with missing data.
5. When imputing, it’s important that the imputations be plausible data points.
6. Missing data isn’t really a problem if I’m just doing simple statistics, like chi-squares and t-tests.
7. The worst thing that missing data does is lower sample size and reduce power.
The default approach to dealing with missing data in most statistical software packages is listwise deletion–dropping any case with data missing on any variable involved anywhere in the analysis. It also goes under the names case deletion and complete case analysis.
Although this approach can be really painful (you worked hard to collect those data, only to drop them!), it does work well in some situations. By works well, I mean it fits 3 criteria:
– gives unbiased parameter estimates
– gives accurate (or at least conservative) standard error estimates
– results in adequate power.
But not always. So over the years, a number of ad hoc approaches have been proposed to stop the bloodletting of so much data. Although each solved some problems of listwise deletion, they created others. All three have been discredited in recent years and should NOT be used. They are:
Pairwise Deletion: use the available data for each part of an analysis. This has been shown to result in correlations beyond the 0,1 range and other fun statistical impossibilities.
Mean Imputation: substitute the mean of the observed values for all missing data. There are so many problems, it’s difficult to list them all, but suffice it to say, this technique never meets the above 3 criteria.
Dummy Variable: create a dummy variable that indicates whether a data point is missing, then substitute any arbitrary value for the missing data in the original variable. Use both variables in the analysis. While it does help the loss of power, it usually leads to biased results.
There are a number of good techniques for dealing with missing data, some of which are not hard to use, and which are now available in all major stat software. There is no reason to continue to use ad hoc techniques that create more problems than they solve.