Missing Data

Andy Grogan-Kaylor

Dealing with missing data is a complex issue. This is an evolving blog post on approaches to this issue.

In the simulated data below, there are several missing values in the first few rows of data.

Table 1: Simulated Data

x	z	y
25.5	1	NA
10.3	NA	10.21
10.43	1	8.274
5.596	0	6.159
11.53	0	8.09
18.26	1	22.38

Here are some possible approaches that we could employ with missing data.

We could do nothing. Doing nothing would be easy. However, our sample size would be reduced because most analyses depend on complete case data. Also, missing data usually occur for a reason. Often respondents with missing data have the lowest incomes, are the most marginalized or discriminated against, and have histories with the most trauma or violence. Therefore relying on complete case analyses might introduce some amount of bias.
We could replace missing values with the mean. Replacing missing values with the mean would be easy. However, because respondents usually having missing data for a reason, replacing missing data with the mean would introduce bias into the data, and would also introduce false certainty into the data (p values would be artificially lowered).
We could use regression to impute the missing values from the complete cases. For example \(\hat{y} = \beta_0 + \beta_1 x + \beta_2 z\). This approach would be more difficult, but would be making use of the information in the covariates about the reasons for missing data, and would therefore reduce bias. However, by imputing a single value for each missing value, we would still be introducing false certainty into the data (p values would be artificially lowered).
We could perform multiple imputation. Multiple imputation would involve using regression to impute the missing values, but we would impute the missing values multiple times, each time introducing a small amount of random variation. This procedure would be more difficult–though readily available in most statistical software–and would both make use of the information in the covariates about the reasons for missing data, thus reducing bias. Further, by imputing the data multiple times, we would preserve the uncertainty around the estimates, thus making better estimates of p values.