There are many ways to approach missing data. The most common, I believe, is to ignore it. But making no choice means that your statistical software is choosing for you. And your software is generally choosing listwise deletion, which may or may not be a bad choice, depending on why and how much data are missing.
Another common approach, among those who are paying attention, is imputation-replacing the missing values with an estimate, then analyzing the full data set as if the imputed values were actual observed values. There are many ways to choose an estimate. The following are common methods:
* Mean: the mean of the observed values for that variable
* Substitution: the value from a new individual who was not selected to be in the sample
* Hot deck: a randomly chosen value from an individual who has similar values on other variables
* Cold deck: a systematically chosen value from an individual who has similar values on other variables
* Regression: the predicted value obtained by regressing the missing variable on other variables
* Stochastic regression: the predicted value from a regression plus a random residual value.
* Interpolation and extrapolation: an estimated value from other observations from the same individual.
Imputation is popular because it is conceptually simple and because the resulting sample has the same number of observations as the full data set. It can be very tempting when listwise deletion eliminates a large proportion of the data set. But it has limitations. Some imputation methods result in biased parameter estimates, such as means and correlations, unless the data are MCAR. The bias is often worse than with complete-case analysis, especially for mean imputation. The extent of the bias depends on many factors, including the missing data mechanism, the proportion of the data that is missing, and the information available in the data set.
Moreover, all of these imputation methods underestimate standard errors. Since the imputed observations are themselves estimates, their values have corresponding random error. But your software doesn’t know that, so it overlooks the extra source of error, resulting in too-small standard errors and too-small p-values. And although imputation is conceptually simple, it is difficult to do well in practice. So it’s not ideal, but might suffice in certain situations.