One important consideration in choosing a missing data approach is the missing data mechanism—different approaches have different assumptions about the mechanism.
Each of the three mechanisms describes one possible relationship between the propensity of data to be missing and values of the data, both missing and observed.
The Missing Data Mechanisms
Missing Completely at Random, MCAR, means there is no relationship between the missingness of the data and any values, observed or missing. Those missing data points are a random subset of the data. There is nothing systematic going on that makes some data more likely to be missing than others.
Missing at Random, MAR, means there is a systematic relationship between the propensity of missing values and the observed data, but not the missing data.
Whether an observation is missing has nothing to do with the missing values, but it does have to do with the values of an individual’s observed variables. So, for example, if men are more likely to tell you their weight than women, weight is MAR.
Missing Not at Random, MNAR, means there is a relationship between the propensity of a value to be missing and its values. This is a case where the people with the lowest education are missing on education or the sickest people are most likely to drop out of the study.
MNAR is called “non-ignorable” because the missing data mechanism itself has to be modeled as you deal with the missing data. You have to include some model for why the data are missing and what the likely values are.
“Missing Completely at Random” and “Missing at Random” are both considered ‘ignorable’ because we don’t have to include any information about the missing data itself when we deal with the missing data.
Why you need to know the mechanism you have
Listwise deletion, however, requires the data are MCAR in order to not introduce bias in the results.
As long as the distribution and percentage of missing data is no so great that it negatively affects power, listwise deletion can be a good choice for MCAR missing data. So the important distinction here is whether the data are MCAR as opposed to MAR.
Keep in mind that in most data sets, more than one variable will have missing data, and they may not all have the same mechanism. It’s worthwhile diagnosing the mechanism for each variable with missing data before choosing an approach.
I use the term diagnosing rather than testing, because you’re not going to get a straight answer without knowing the values of the missing data. Of course, if you knew those, you wouldn’t be doing any of this.
It’s like checking for multicollinearity or testing assumptions. Each piece of information tells you something, but there is no definitive answer.
You have to get at the mechanism in a number of ways and then decide if making the assumption about the mechanism is reasonable.
Diagnosing the Mechanism
1. MAR vs. MNAR
The only true way to distinguish between MNAR and MAR is to measure some of that missing data. It’s a common practice among professional surveyors to, for example, follow-up on a paper survey with phone calls to a group of the non-respondents and ask a few key survey items. This allows you to compare respondents to non-respondents.
If their responses on those key items differ by very much, that’s good evidence that the data are MNAR.
However in most missing data situations, we don’t have the luxury of getting a hold of the missing data. So while we can’t test it directly, we can examine patterns in the data get an idea of what’s the most likely mechanism.
The first thing in diagnosing randomness of the missing data is to use your substantive scientific knowledge of the data and your field. The more sensitive the issue, the less likely people are to tell you. They’re not going to tell you as much about their cocaine usage as they are about their phone usage.
Likewise, many fields have common research situations in which non-ignorable data is common. Educate yourself in your field’s literature.
2. MCAR vs. MAR
There is a very useful test for MCAR, Little’s test. But like all tests of assumptions, it’s not definitive. So run it, but use it as only one piece of information.
A second technique is to create dummy variables for whether a variable is missing.
1 = missing
0 = observed
You can then run t-tests and chi-square tests between this variable and other variables in the data set to see if the missingness on this variable is related to the values of other variables.
For example, if women really are less likely to tell you their weight than men, a chi-square test will tell you that the percentage of missing data on the weight variable is higher for women than men.
The SPSS Missing Data module has a very nice procedure for doing this automatically–you don’t have to create all those dummy variables. I don’t know of other software packages having this built in, but it’s not hard to program.