A comparison of synthetic data approaches using utility and disclosure risk measures

Seongbin An;Trang Doan;Juhee Lee;Jiwoo Kim;Yong Jae Kim;Yunji Kim;Changwon Yoon;Sungkyu Jung;Dongha Kim;Sunghoon Kwon;Hang J Kim;Jeongyoun Ahn;Cheolwoo Park;

doi:10.5351/KJAS.2023.36.2.141

The Korean Journal of Applied Statistics (응용통계연구)

Volume 36 Issue 2
/
Pages.141-166
/
2023
/
1225-066X(pISSN)
/
2383-5818(eISSN)

The Korean Statistical Society (한국통계학회)

DOI QR Code

A comparison of synthetic data approaches using utility and disclosure risk measures

유용성과 노출 위험성 지표를 이용한 재현자료 기법 비교 연구

Seongbin An (Department of Industrial & Systems Engineering, KAIST) ;
Trang Doan (Department of Applied Statistics, Konkuk University) ;
Juhee Lee (Department of Statistics, Kyungpook National University) ;
Jiwoo Kim (Department of Statistics, Sungshin Women's University) ;
Yong Jae Kim (Department of Statistics, Seoul National University) ;
Yunji Kim (Department of Industrial & Systems Engineering, KAIST) ;
Changwon Yoon (Department of Industrial & Systems Engineering, KAIST) ;
Sungkyu Jung (Department of Statistics, Seoul National University) ;
Dongha Kim (Department of Statistics, Sungshin Women's University) ;
Sunghoon Kwon (Department of Applied Statistics, Konkuk University) ;
Hang J Kim (Department of Industrial & Systems Engineering, KAIST) ;
Jeongyoun Ahn (Division of Statistics and Data Science, University of Cincinnati) ;
Cheolwoo Park (Department of Mathematical Sciences, KAIST)

안성빈 (한국과학기술원 산업및시스템공학과) ;
트랑 도안 (건국대학교 응용통계학과) ;
이주희 (경북대학교 통계학과) ;
김지우 (성신여자대학교 통계학과) ;
김용재 (서울대학교 통계학과) ;
김윤지 (한국과학기술원 산업및시스템공학과) ;
윤창원 (한국과학기술원 산업및시스템공학과) ;
정성규 (서울대학교 통계학과) ;
김동하 (성신여자대학교 통계학과) ;
권성훈 (건국대학교 응용통계학과) ;
김항준 (신시내티 대학교 통계 데이터사이언스 분과) ;
안정연 (한국과학기술원 산업및시스템공학과) ;
박철우 (한국과학기술원 수리과학과)

Received : 2022.11.24
Accepted : 2023.01.12
Published : 2023.04.30

https://doi.org/10.5351/KJAS.2023.36.2.141 Citation PDF

Download PDF

⟨ Previous Next ⟩

Abstract

This paper investigates synthetic data generation methods and their evaluation measures. There have been increasing demands for releasing various types of data to the public for different purposes. At the same time, there are also unavoidable concerns about leaking critical or sensitive information. Many synthetic data generation methods have been proposed over the years in order to address these concerns and implemented in some countries, including Korea. The current study aims to introduce and compare three representative synthetic data generation approaches: Sequential regression, nonparametric Bayesian multiple imputations, and deep generative models. Several evaluation metrics that measure the utility and disclosure risk of synthetic data are also reviewed. We provide empirical comparisons of the three synthetic data generation approaches with respect to various evaluation measures. The findings of this work will help practitioners to have a better understanding of the advantages and disadvantages of those synthetic data methods.

재현자료를 생성하여 배포하는 것은 데이터 공개에 따른 정보 유출의 위험을 방지하는 대표적인 방법이다. 최근 산업에서 데이터의 활용이 중요해진 만큼 한국을 포함한 많은 국가 및 기관에서 재현자료에 관한 연구가 활발히 진행되고 있다. 본 논문에서는 대표적인 재현자료 생성 기법들과 평가 지표들을 소개한다. 전통적인 재현자료 생성 방법인 다중대체와 최근 제시된 인공신경망 기반의 재현자료 생성 방법 등을 활용하여 재현자료를 생성하는 과정을 기술함에 따라 재현자료 생성 방법에 대한 전반적인 이해를 돕는다. 이에 더해 다양한 재현자료 평가 지표를 바탕으로 생성된 재현자료들을 분석 및 비교함에 따라 앞으로의 연구에 대한 방향을 제시하고 그에 대한 토대를 마련하고자 한다.

Keywords

Acknowledgement

이 논문은 2022년도 정부(과학기술정보통신부)의 재원으로 정보통신기획평가원의 지원을 받아 수행된 연구임 (No.2022-0-00937, 통계데이터 재현자료기법의 활용성과 유용성을 높여야 하는 문제 해결)

References

Alaa A, Van Breugel B, Saveliev ES, and van der Schaar M (2022). How faithful is your synthetic data? Sample-level metrics for evaluating and auditing generative models, International Conference on Machine Learning, 290-306, PMLR.
Arjovsky M, Chintala S, and Bottou L (2017). Wasserstein generative adversarial networks, International Conference on Machine Learning, 214-223, PMLR.
Arthur D and Vassilvitskii S (2007) K-means plus plus: The advantages of careful seeding, In Proceedings of the Eighteenth Annual Acm-Siam Symposium on Discrete Algorithms, New Orleans, Louisiana, USA, 1027-1035.
Breiman L, Friedman JH, Olshen RA, and Stone CJ (2017). Classification and Regression Trees, Routledge, New York.
Dhariwal P and Nichol A (2021). Diffusion models beat gans on image synthesis, Advances in Neural Information Processing Systems, 34, 8780-8794.
Drechsler J and Reiter JP (2009). Disclosure risk and data utility for partially synthetic data: An empirical study using the german iab establishment survey, Journal of Official Statistics, 25, 589-603.
EI Emam K, Mosquera L, and Bass J (2020). Evaluating identity disclosure risk in fully synthetic health data: Model development and validation, Journal of Medical Internet Research, 22, e23139.
Elliot M (2015). Final report on the disclosure risk associated with the synthetic data produced by the sylls team, Report 2015, 2.
Gulrajani I, Ahmed F, Arjovsky M, Dumoulin V, and Courville AC (2017). Improved training of Wasserstein GANs, Advances in Neural Information Processing Systems, 30, 1-11.
Hilprecht B, Harterich M, and Bernau D (2019). Monte carlo and reconstruction membership inference attacks ' against generative models, Proceedings on Privacy Enhancing Technologies, 2019, 232-249.
Hu J and Savitsky TD (2018). Bayesian data synthesis and disclosure risk quantification: An application to the consumer expenditure surveys, Available from: arXiv preprint arXiv:1809.10074
Ishwaran H and James LF (2001). Gibbs sampling methods for stick-breaking priors, Journal of the American Statistical Association, 96, 161-173. https://doi.org/10.1198/016214501750332758
Karr AF, Kohnen CN, Oganian A, Reiter JP, and Sanil AP (2006). A framework for evaluating the utility of data altered to protect confidentiality, The American Statistician, 60, 224-232. https://doi.org/10.1198/000313006X124640
Khamis H (2008). Measures of association: How to choose?, Journal of Diagnostic Medical Sonography, 24, 155-162. https://doi.org/10.1177/8756479308317006
Kingma DP and Welling M (2013). Auto-encoding variational Bayes, Available from: arXiv preprint arXiv:1312.6 114
Kim HJ, Drechsler J, and Thompson KJ(2021). Synthetic microdata for establishment surveys under informative sampling, Journal of the Royal Statistical Society: Series A, 184, 255-281. https://doi.org/10.1111/rssa.12622
Kim J and Park M-J (2019). Multiple imputation and synthetic data, The Korean Journal of Applied Statistics, 32, 83-97. https://doi.org/10.5351/KJAS.2019.32.1.083
Kullback S and Leibler RA (1951). On information and sufficiency, The Annals of Mathematical Statistics, 22, 79-86. https://doi.org/10.1214/aoms/1177729694
Lee Y (2013). Review on statistical methods for protecting privacy and measuring risk of disclosure when releasing information for public use, Journal of the Korean Data and Information Science Society, 24, 1029-1041. https://doi.org/10.7465/JKDI.2013.24.5.1029
Lin Z, Khetan A, Fanti G, and Oh S (2018). The power of two samples in generative adversarial networks, Advances in Neural Information Processing Systems, 31, 1-10.
Little RJA (1993). Statistical analysis of masked data, Journal of Official Statistics, Stockholm, 9, 407-407.
Markus H, Rudolf M, and Andreas E (2020). A baseline for attribute disclosure risk in synthetic data, In Proceedings of the Tenth ACM Conference on Data and Application Security and Privacy (CODASPY'20), March 16-18, 2020, New Orleans, LA, USA, ACM, New York, NY, USA, 11, Available from: https://doi.org/10.1145/3374664.3375722
Murray JS and Reiter JP (2016). Multiple imputation of missing categorical and continuous values via bayesian mixture models with local dependence, Journal of the American Statistical Association, 111, 1466-1479. https://doi.org/10.1080/01621459.2016.1174132
Nowok B, Raab GM, and Dibben C (2016). Synthpop: Bespoke creation of synthetic data in R, Journal of Statistical Software, 74, 1-26. https://doi.org/10.18637/jss.v074.i11
Park MJ, Kwon SP, and Shim KH (2013). Microdata masking for Survey of Household Finances and Living Conditions, Statistical Research Institute, Daejeon.
Park M-J, Han J, and Park N (2020). Study on synthetic data generation methods with applications to statistics Korea RDC data, Technical report, Statistical Research Institute.
Raghunathan TE, Reiter JP, and Rubin DB (2003). Multiple imputation for statistical disclosure limitation, Journal of Official Statistics, 19, 1-16.
Reiter JP (2003). Inference for partially synthetic, public use microdata sets, Survey Methodology, 29, 181-188.
Reiter JP (2005). Using CART to generate partially synthetic public use microdata, Journal of Official Statistics, 21, 441-462.
Rosenbaum PR and Rubin DB (1983). The central role of the propensity score in observational studies for causal effects, Biometrika, 70, 41-55. https://doi.org/10.1093/biomet/70.1.41
Rubin DB (1993). Statistical disclosure limitation, Journal of Official Statistics, 9, 461-468.
Scholkopf B, Platt JC, Shawe-Taylor J, Smola AJ, and Williamson RC (2001). Estimating the support of a highdimensional distribution, Neural Computation, 13, 1443-1471. https://doi.org/10.1162/089976601750264965
Snoke J, Raab GM, Nowok B, Dibben C, and Slavkovic A (2018). General and specific utility measures for synthetic data, Journal of the Royal Statistical Society: Series A, 181, 663-688. https://doi.org/10.1111/rssa.12358
Song Y and Ermon S (2019). Generative modeling by estimating gradients of the data distribution, Advances in Neural Information Processing Systems, 32, 11895-11907.
Song Y, Sohl-Dickstein J, Kingma DP, Kumar A, Ermon S, and Poole B (2020). Score-based generative modeling through stochastic differential equations, International Conference on Learning Representations, Available from: https://arxiv.org/abs/2011.13456
Stan M, Jordi N, Morvarid S, and Tomasz S (2015). A review of attribute disclosure control, Advanced Research in Data Privacy, 567, 41-61. https://doi.org/10.1007/978-3-319-09885-2_4
Villani C (2008). Optimal Transport: Old and New, Springer, New York.
Woo M-J, Reiter JP, Oganian A, and Karr AF (2009). Global measures of data utility for microdata masked for disclosure limitation, Journal of Privacy and Confidentiality, 1, 111-124. https://doi.org/10.29012/jpc.v1i1.568
Xu L, Skoularidou M, Cuesta-Infante A, and Veeramachaneni K (2019). Modeling tabular data using conditional GAN, Advances in Neural Information Processing Systems, 32, 7333-7343.
Yoon J, Jarrett D, and Van der Schaar M (2019). Time-series generative adversarial networks, Advances in Neural Information Processing Systems, 32, 5509-5519.

The Korean Journal of Applied Statistics (응용통계연구)

A comparison of synthetic data approaches using utility and disclosure risk measures

유용성과 노출 위험성 지표를 이용한 재현자료 기법 비교 연구

Abstract

Keywords

Acknowledgement

References

Detail Search