Browse > Article
http://dx.doi.org/10.5351/KJAS.2019.32.1.083

Multiple imputation and synthetic data  

Kim, Joungyoun (Department of Information & Statistics, Chungbuk National University)
Park, Min-Jeong (Statistical Research Institute, Statistics Korea)
Publication Information
The Korean Journal of Applied Statistics / v.32, no.1, 2019 , pp. 83-97 More about this Journal
Abstract
As society develops, the dissemination of microdata has increased to respond to diverse analytical needs of users. Analysis of microdata for policy making, academic purposes, etc. is highly desirable in terms of value creation. However, the provision of microdata, whose usefulness is guaranteed, has a risk of exposure of personal information. Several methods have been considered to ensure the protection of personal information while ensuring the usefulness of the data. One of these methods has been studied to generate and utilize synthetic data. This paper aims to understand the synthetic data by exploring methodologies and precautions related to synthetic data. To this end, we first explain muptiple imputation, Bayesian predictive model, and Bayesian bootstrap, which are basic foundations for synthetic data. And then, we link these concepts to the construction of fully/partially synthetic data. To understand the creation of synthetic data, we review a real longitudinal synthetic data example which is based on sequential regression multivariate imputation.
Keywords
synthetic data; multiple imputation; Bayesian prediction model; Bayesian bootstrap; microdata;
Citations & Related Records
Times Cited By KSCI : 1  (Citation Analysis)
연도 인용수 순위
1 Abowd, J. M., Kramarz, F., and Margolis, D. N. (1999). High wage workers and high wage firms, Econometrica, 67, 251-333.   DOI
2 Abowd, J. M. and Woodcock, S. D. (2001). Disclosure limitation in longitudinal linked data. In P. Doyle, J. Lane, J. Theeuwes, L. Zayatz (Eds.) Confidentiality, Disclosure, and Data Access: Theory and Practical Applications for Statistical Agencies (pp. 215-277), Amsterdam, North Holland.
3 Clyde, M. A. and Lee, H. K. H. (2001). Bagging and the Bayesian bootstrap. In T. Richardson and T. Jaakkola (Eds) Artificial Intelligence and Statistics (pp. 169-174), Morgan Kaufmann, Burlington.
4 Drechsler, J. (2018). Some clarifications regrading fully synthetic data. In Domingo-Ferrer, J., Montes, F. (eds.) LNCS, (Vol. 11126, pp. 109-121), Springer, Heidelberg.
5 Efron, B. (1979). Bootstrap methods: another look at the jackknife, Annals of Statistics, 7, 1-26.   DOI
6 Little, R. J. A. (1993). Statistical analysis of masked data, Journal of Official Statistics, 9, 407-426.
7 Machanavajjhala, A., Kifer, D., Abowd, J., Gehrke, J., and Vilhuber, L. (2008). Privacy: theory meets practice on the map. In Proceedings of the 24th International Conference on Data Engineering, 277-286.
8 Park, M. J. (2016). Comparative study on the recent SDC methods. Statistical Research Institute.
9 Park, M. J. and Kim, H. (2016). Statistical disclosure control for public microdata: present and future, Korean Journal of Applied Statistics, 39, 1041-1059.   DOI
10 Park, M. J. and Kim, J. (2017). Reveiw on the synthetic data generation methodologies. Statistical Research Institute.
11 Raab, G. M., Nowork, B., and Dibben, C. (2017). Practical data synthesis for large samples. Journal of Privacy and Confidentiality, 7, 67-97.
12 Reiter, J. P. (2003). Inference for partially synthetic, public use microdata sets, Survey Methodology, 29, 181-188.
13 Raghunathan, T. E., Lepkowski, J. M., Hoewyk, J. V., and Solenberger, P. (2001). A multivariate technique for multiply imputing missing values using a sequence of regression models, Statistics Canada, 27, 85-95.
14 Raghunathan, T. E., Reiter, J. P., and Rubin, D. B. (2003). Multiple imputation for statistical disclosure limitation. Journal of Official Statistics, 19, 1-16.
15 Reiter, J. P. (2002). Satisfying disclosure restrictions with synthetic data sets, Journal of Official Statistics, 18, 531-543.
16 Rubin, D. B. (1987). Multiple Imputation for Nonresponse in Surveys, John Wiley & Sons, New York.
17 Reiter, J. P. (2004). Significance tests for multi-component estimands from multiply imputed, synthetic microdata, Journal of Statistical Planning and Inference, 131, 365-377.   DOI
18 Rubin, D. B. (1978). Multiple imputations in sample surveys - a phenomenological Bayesian approach to nonresponse. In Proceedings of the Survey Research Methods Section, American Statistical Association, 20-34.
19 Rubin, D. B. (1981). The Bayesian bootstrap, Annals of Statistics, 9, 130-134.   DOI
20 Rubin, D. B. (1988). An overview of multiple imputation. In Proceedings of the Survey Research Section, American Statistical Association, 79-84.
21 Rubin, D. B. (1993). Discussion statistical disclosure limitation, Journal of Official Statistics, 9, 461-468.