Data Cleaning is a critically important part of any data analysis. Without properly prepared data, the analysis will yield inaccurate results. Correcting errors later in the analysis adds to the time, effort, and cost of the project.

# Missing Data

## Member Training: Multiple Imputation for Missing Data

There are a number of simplistic methods available for tackling the problem of missing data. Unfortunately there is a very high likelihood that each of these simplistic methods introduces bias into our model results.

Multiple imputation is considered to be the superior method of working with missing data. It eliminates the bias introduced by the simplistic methods in many missing data situations.

[Read more…] about Member Training: Multiple Imputation for Missing Data

## Six Differences Between Repeated Measures ANOVA and Linear Mixed Models

As mixed models are becoming more widespread, there is a lot of confusion about when to use these more flexible but complicated models and when to use the much simpler and easier-to-understand repeated measures ANOVA.

One thing that makes the decision harder is sometimes the results are *exactly the same* from the two models and sometimes the results are [Read more…] about Six Differences Between Repeated Measures ANOVA and Linear Mixed Models

## Linear Mixed Models for Missing Data in Pre-Post Studies

In the past few months, I’ve gotten the same question from a few clients about using linear mixed models for repeated measures data. They want to take advantage of its ability to give unbiased results in the presence of missing data. In each case the study has two groups complete a pre-test and a post-test measure. Both of these have a lot of missing data.

The research question is whether the groups have different improvements in the dependent variable from pre to post test.

As a typical example, say you have a study with 160 participants.

90 of them completed both the pre and the post test.

Another 48 completed only the pretest and 22 completed only the post-test.

Repeated Measures ANOVA will deal with the missing data through listwise deletion. That means keeping only the 90 people with complete data. This causes problems with both power and bias, but bias is the bigger issue.

Another alternative is to use a Linear Mixed Model, which will use the full data set. This is an advantage, but it’s not as big of an advantage in this design as in other studies.

The mixed model *will* retain the 70 people who have data for only one time point. It will use the 48 people with pretest-only data along with the 90 people with full data to estimate the pretest mean.

Likewise, it will use the 22 people with posttest-only data along with the 90 people with full data to estimate the post-test mean.

If the data are missing at random, this will give you unbiased estimates of each of these means.

But most of the time in Pre-Post studies, the interest is in the change from pre to post across groups.

The difference in means from pre to post will be calculated based on the estimates at each time point. But the degrees of freedom for the difference will be based only on the number of subjects who have data *at both* time points.

So with only two time points, if the people with one time point are no different from those with full data (creating no bias), you’re *not gaining anything* by keeping those 72 people in the analysis.

Compare this to a study I also saw in consulting with 5 time points. Nearly all the participants had 4 out of the 5 observations. The missing data was pretty random–some participants missed time 1, others, time 4, etc. Only 6 people out of 150 had full data. Listwise deletion created a nightmare, leaving only 6 people in the data set.

Each person contributed data to 4 means, so each mean had a pretty reasonable sample size. Since the missingness was random, each mean was unbiased. Each subject fully contributed data and df to many of the mean comparisons.

With more than 2 time points and data that are missing at random, each subject can contribute to some change measurements. Keep that in mind the next time you design a study.

## Linear Regression in Stata: Missing Data and the Stories it Might Tell

*by Jeff Meyer
*

In a previous post , Using the Same Sample for Different Models in Stata, we examined how to use the same sample when comparing regression models. Using different samples in our models could lead to erroneous conclusions when interpreting results.

But excluding observations can also result in inaccurate results.

The coefficient for the variable “frequent religious attendance” was negative 58 in model 3 [Read more…] about Linear Regression in Stata: Missing Data and the Stories it Might Tell

## Multiple Imputation for Missing Data: Indicator Variables versus Categorical Variables

A data set can contain indicator (dummy) variables, categorical variables and/or both. Initially, it all depends upon how the data is coded as to which variable type it is.

For example, a categorical variable like marital status could be coded in the data set as a single variable with 5 values: [Read more…] about Multiple Imputation for Missing Data: Indicator Variables versus Categorical Variables