Multiple Imputation for Propensity Score Analysis with Covariates Missing at Random

Introduction

Missing data is a common problem in statistical analysis, and can lead to biased or inefficient estimates if not handled properly. One method for dealing with missing data is multiple imputation, which involves creating multiple plausible values for the missing data and analyzing each imputed dataset separately, before combining the results. In this blog post, we will discuss the use of multiple imputation for propensity score analysis with covariates missing at random.

Propensity Score Analysis

Propensity score analysis is a method used to estimate the treatment effect in observational studies, where the treatment is not randomly assigned. The propensity score is the probability of receiving the treatment, given a set of covariates. By matching or weighting on the propensity score, the treated and control groups can be made more similar, reducing the bias in the estimate of the treatment effect.

Missing Data

Missing data can be a problem in propensity score analysis, as it can lead to biased or inefficient estimates. Missing data can be classified into three categories: missing completely at random (MCAR), missing at random (MAR), and missing not at random (MNAR). MCAR means that the probability of missing data is unrelated to the missing data and the variables in the study. MAR means that the probability of missing data is related to the observed data, but not the missing data. MNAR means that the probability of missing data is related to the missing data.

Multiple Imputation

Multiple imputation is a method for dealing with missing data that creates multiple plausible values for the missing data and analyzes each imputed dataset separately, before combining the results. This approach can provide more accurate and efficient estimates than single imputation methods, such as mean or hot-deck imputation.

Multiple Imputation for Propensity Score Analysis

Multiple imputation can be used in conjunction with propensity score analysis to deal with missing data. The process involves creating multiple imputed datasets, each with plausible values for the missing data. Propensity scores are estimated for each imputed dataset, and the treatment effect is estimated for each dataset. The results are then combined to obtain the final estimate of the treatment effect.

It is important to note that multiple imputation assumes that the data is missing at random (MAR) or missing completely at random (MCAR). If the data is missing not at random (MNAR), multiple imputation may not provide accurate results.

Implementation of Multiple Imputation for Propensity Score Analysis

Multiple imputation can be implemented in R using the mice package. The following code shows an example of how to use the mice package to impute missing data and estimate the propensity scores.

#install the package
install.packages("mice")

#load the package
library(mice)

#specify the imputation method
imputation_method <- "logreg"

#impute the missing data
imputed_data <- mice(data, method = imputation_method)

#estimate the propensity scores
propensity_scores <- glm(treatment ~ covariates, data = imputed_data, family = binomial())

In this example, we used the logreg method for imputing the missing data, but other methods such as pmm or polyreg can also be used. The glm function is used to estimate the propensity scores, but other methods such as logit or probit can also be used.

Conclusion

Missing data can be a problem in statistical analysis, and multiple imputation is a useful method for dealing with missing data. By creating multiple plausible values for the missing data and analyzing each imputed dataset separately, multiple imputation can provide more accurate and efficient estimates than single imputation methods. The use of multiple imputation in conjunction with propensity score analysis can help to reduce the bias in the estimate of the treatment effect in observational studies with missing data. It is important to note that multiple imputation assumes that the data is missing at random (MAR) or missing completely at random (MCAR). If the data is missing not at random (MNAR), multiple imputation may not provide accurate results.

References
  • Little, R. J. A., & Rubin, D. B. (2002). Statistical analysis with missing data (Vol. 1). John Wiley & Sons.
  • Rubin, D. B. (1987). Multiple imputation for nonresponse in surveys. John Wiley & Sons.
  • van Buuren, S., & Groothuis-Oudshoorn, K. (2011). mice: Multivariate imputation by chained equations in R. Journal of Statistical Software 45(3)