# How to Diagnose the Missing Data Mechanism

by

One important consideration in choosing a missing data approach is the missing data mechanism—different approaches have different assumptions about the mechanism.

Each of the three mechanisms describes one possible relationship between the propensity of data to be missing and values of the data, both missing and observed.

## The Missing Data Mechanisms

Missing Completely at Random, MCAR, means there is no relationship between the missingness of the data and any values, observed or missing. Those missing data points are a random subset of the data. There is nothing systematic going on that makes some data more likely to be missing than others.

Missing at Random, MAR, means there is a systematic relationship between the propensity of missing values and the observed data, but not the missing data.

Whether an observation is missing has nothing to do with the missing values, but it does have to do with the values of an individual’s observed variables. So, for example, if men are more likely to tell you their weight than women, weight is MAR.

Missing Not at Random, MNAR, means there is a relationship between the propensity of a value to be missing and its values. This is a case where the people with the lowest education are missing on education or the sickest people are most likely to drop out of the study.

MNAR is called “non-ignorable” because the missing data mechanism itself has to be modeled as you deal with the missing data. You have to include some model for why the data are missing and what the likely values are.

“Missing Completely at Random” and “Missing at Random” are both considered ‘ignorable’ because we don’t have to include any information about the missing data itself when we deal with the missing data.

## Why you need to know the mechanism you have

Multiple imputation and Maximum Likelihood assume the data are at least missing at random. So the important distinction here is whether the data are MAR as opposed to MNAR.

Listwise deletion, however, requires the data are MCAR in order to not introduce bias in the results.

As long as the distribution and percentage of missing data is no so great that it negatively affects power, listwise deletion can be a good choice for MCAR missing data. So the important distinction here is whether the data are MCAR as opposed to MAR.

Keep in mind that in most data sets, more than one variable will have missing data, and they may not all have the same mechanism. It’s worthwhile diagnosing the mechanism for each variable with missing data before choosing an approach.

I use the term diagnosing rather than testing, because you’re not going to get a straight answer without knowing the values of the missing data. Of course, if you knew those, you wouldn’t be doing any of this.

It’s like checking for multicollinearity or testing assumptions. Each piece of information tells you something, but there is no definitive answer.

You have to get at the mechanism in a number of ways and then decide if making the assumption about the mechanism is reasonable.

## Diagnosing the Mechanism

1. MAR vs. MNAR

The only true way to distinguish between MNAR and MAR is to measure some of that missing data. It’s a common practice among professional surveyors to, for example, follow-up on a paper survey with phone calls to a group of the non-respondents and ask a few key survey items. This allows you to compare respondents to non-respondents.

If their responses on those key items differ by very much, that’s good evidence that the data are MNAR.

However in most missing data situations, we don’t have the luxury of getting a hold of the missing data. So while we can’t test it directly, we can examine patterns in the data get an idea of what’s the most likely mechanism.

The first thing in diagnosing randomness of the missing data is to use your substantive scientific knowledge of the data and your field. The more sensitive the issue, the less likely people are to tell you. They’re not going to tell you as much about their cocaine usage as they are about their phone usage.

Likewise, many fields have common research situations in which non-ignorable data is common. Educate yourself in your field’s literature.

2. MCAR vs. MAR

There is a very useful test for MCAR, Little’s test. But like all tests of assumptions, it’s not definitive. So run it, but use it as only one piece of information.

A second technique is to create dummy variables for whether a variable is missing.

1 = missing
0 = observed

You can then run t-tests and chi-square tests between this variable and other variables in the data set to see if the missingness on this variable is related to the values of other variables.

For example, if women really are less likely to tell you their weight than men, a chi-square test will tell you that the percentage of missing data on the weight variable is higher for women than men.

The SPSS Missing Data module has a very nice procedure for doing this automatically–you don’t have to create all those dummy variables. I don’t know of other software packages having this built in, but it’s not hard to program.

Learn the ins and outs of missing data in our new On Demand workshop: Missing Data: Effectively Dealing with Missing Data Without Biasing your Results.

DJ

Dear Karen,

Thanks for these excellent pages on missing data and multiple imputation. This is all very new to me and I’m finding the statistical literature gets quite heavy, quite quickly, making it difficult to follow for any novice.

I’ve played around a little bit with MI packages in R (such as Amelia II) so I’m fairly comfortable with creating the imputed datasets. But my biggest problem at the moment is in understanding whether or not this approach is appropriate for my data.

I’m analysing regional records that I have acquired from a central government office. A small number of regions were not able (for unknown reasons) to return the requested data when the central office surveyed each region in 2012, as a consequence there is missing data for a small amount of government regions. There does not appear to be any systematic reason for each region not returning this data. Regions with missing data include a mix of affluent urban regions, deprived urban regions, as well as, affluent rural regions and deprived rural areas. The trouble I’m having is that I’m not sure these observations are sufficient to conclude that missingness is MAR as opposed to MNAR. If not, what does sufficient evidence to make this conclusion look like?

This seems like an extremely important distinction for anyone considering MI techniques, yet a consideration that is not discussed in any great depth in the literature (at least not in a way I can understand). Most textbooks and online demonstrations warn against performing MI on missing data that is MNAR, but say little about how to judge this.

In your post above you have offered one of the best explanations I have seen yet, but I was wondering if you could elaborate on this at all?

Best wishes

DJ