DOI QR코드

DOI QR Code

Statistical disclosure control for public microdata: present and future

마이크로데이터 공표를 위한 통계적 노출제어 방법론 고찰

  • Park, Min-Jeong (Statistical Research Institute, Statistics Korea) ;
  • Kim, Hang J. (Department of Mathematical Sciences, University of Cincinnati)
  • Received : 2016.08.31
  • Accepted : 2016.10.09
  • Published : 2016.10.31

Abstract

The increasing demand from researchers and policy makers for microdata has also increased related privacy and security concerns. During the past two decades, a large volume of literature on statistical disclosure control (SDC) has been published in international journals. This review paper introduces relatively recent SDC approaches to the communities of Korean statisticians and statistical agencies. In addition to the traditional masking techniques (such as microaggregation and noise addition), we introduce an online analytic system, differential privacy, and synthetic data. For each approach, the application example (with pros and cons, as well as methodology) is highlighted, so that the paper can assist statical agencies that seek a practical SDC approach.

학술 연구나 정책 입안 등을 위한 심층적 자료 활용의 확대는 동시에 개별 정보 노출에 대한 염려도 증가시킨다. 때문에 최근 이십여 년 간 통계적 노출제어(정보보호) 분야에서 많은 논문들이 발표되었다. 본 논문은 그러한 연구 내용들을 정리하여 국내 통계인들과 기관들에게 소개하고자 한다. 주요 내용으로 국소통합이나 잡음추가와 같은 전통적인 매스킹 기법 뿐만 아니라, 온라인 자료 분석 시스템에서의 정보보호 처리, 차등정보보호를 통한 노출제어 및 재현자료를 활용한 정보보호 대안 모색에 대해 다룬다. 또한 각각의 주제에 대한 방법론 소개와 함께 활용 사례 및 장단점을 논의하였다. 본 논문이 실제적인 통계적 노출제어 문제를 고민하는 통계인들에게 도움이 되기를 바란다.

Keywords

References

  1. Abowd, J. M., Stinson, M., and Benedetto, G. (2006). Final report to the Social Security Administration on the SIPP/SSA/IRS public use file project, Technical Report, U.S. Census Bureau Longitudinal Employer-Household Dynamics Program.
  2. Abowd, J. M. and Vilhuber, L. (2008). How protective are synthetic data? In J. Domingo-Ferrer and Y. Saygin (Eds), Privacy in Statistical Databases (pp. 239-246), Springer-Verlag Berlin, Heidelberg.
  3. Abowd, J. M. and Woodcock, S. D. (2001). Disclosure limitation in longitudinal linked data, In P. Doyle, J. Lane, L. Zayatz, and J. Theeuwes (Eds), Confidentiality, Disclosure, and Data Access: Theory and Practical Applications for Statistical Agencies (pp. 215-277), North-Holland, Amsterdam.
  4. Abowd, J. M. and Woodcock, S. D. (2004). Multiply-imputing confidential characteristics and file links in longitudinal linked data, In Privacy in Statistical Databases (pp. 290-297), Springer Berlin, Heidelberg.
  5. Bethlehem, J. G., Keller, W. J., and Panneko, J. (1990). Disclosure control of microdata. Journal of the American Statistical Association, 85, 38-45. https://doi.org/10.1080/01621459.1990.10475304
  6. Blum, A., Dwork, C., McSherry, F., and Nissim, K. (2005). Practical privacy: The SuLQ framework, In Proceedings of the 24th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (pp. 128-138), Association for Computing Machinery, New York.
  7. Chipperfield, J. and Yu, F. (2011). Protecting confidentiality in a remote analysis server for tabulation and analysis of data, Paper presented at the October 2011 UNECE Work Session on Statistical Data Confidentiality.
  8. Drechsler, J. (2012). New data dissemination approaches in old Europe - synthetic datasets for a German establishment survey. Journal of Applied Statistics, 39, 243-265. https://doi.org/10.1080/02664763.2011.584523
  9. Drechsler, J., Bender, S., and Rassler, S. (2008). Comparing fully and partially synthetic datasets for statistical disclosure control in the German IAB Establishment Panel. Transactions on Data Privacy, 1, 1002-1050.
  10. Drechsler, J. and Reiter, J. P. (2009). Disclosure risk and data utility for partially synthetic data: an empirical study using the German IAB Establishment Survey. Journal of Official Statistics, 25, 589-603.
  11. Drechsler, J. and Reiter, J. P. (2011). An empirical evaluation of easily implemented, nonparametric methods for generating synthetic datasets. Computational Statistics and Data Analysis, 55, 3232-3243. https://doi.org/10.1016/j.csda.2011.06.006
  12. Duncan, G. T., Elliot, M., and Gonzalez J. J. S. (2011). Statistical confidentiality: principles and practice, Springer.
  13. Duncan, G. and Lambert, D. (1989). The risk of disclosure for microdata. Journal of Business & Economic Statistics, 7, 207-217.
  14. Dwork, C. (2006). Differential Privacy, In Inference Control in Statistical Databases (pp. 1-12), Springer, Berlin, Heidelberg.
  15. Dwork, C., McSherry, F., Nissim, K., and Smith, A. (2006). Calibrating noise to sensitive in private data analysis, In Proceedings of the 3rd Theory of Cryptography Conference (pp. 265-284), Springer, New York.
  16. Dwork, C. and Smith, A. (2009). Differential privacy for statistics: What we know and what we want to learn. Journal of Privacy and Confidentiality, 1, 135-154.
  17. Franconi, L. and Polettini, S. (2004). Individual risk estimation in ${\mu}$-Argus: a review, In Privacy in Statistical Databases (pp. 262-272), Springer, New York.
  18. Jeong, D. M. and Jeong, M. (2008). A method of masking for 2005 Korean Census microdata. Korean Journal of Applied Statistics, 21, 313-325. https://doi.org/10.5351/KJAS.2008.21.2.313
  19. Jeong, D. M. and Kang, D. H. (2006). Disclosure control methods to increase microdata usage (the original title is written in Korean), Daejeon, Korea.
  20. Jeong, D. M., Kim, J. J., and Kim, K. M. (2009). A method of masking based on multiplicative noise. Korean Journal of Applied Statistics, 22, 141-151. https://doi.org/10.5351/KJAS.2009.22.1.141
  21. Karr, A. F., Kohnen, C. N., Oganian, A. Reiter, J. P., and Sanil, A. P. (2006). A framework for evaluating the utility of data altered to protect confidentiality. The American Statistician, 60, 1-9. https://doi.org/10.1198/000313006X93258
  22. Kim, H. J., Karr, A. F., and Reiter, J. P. (2015). Statistical disclosure limitation in the presence of edit rules. Journal of Official Statistics, 31, 1-18 https://doi.org/10.1515/jos-2015-0001
  23. Kim, K., Lee, E., and Jeong, M. (2007). A case study on the overseas release system of microdata, Statistical Research Institute.
  24. Kim, K.-S. (2009). Release of microdata and statistical disclosure control techniques. Communications for Statistical Applications and Methods, 16, 1-11. https://doi.org/10.5351/CKSS.2009.16.1.001
  25. Kim, K. Y., Kwon, D. H., Shin, J. E., and Lee. S. H. (2011). Introduction to Statistical Disclosure Control (the original title is written in Korean), Freeacademy, Gyeonggi-do.
  26. Kim, Y.-W., Kim, T.-Y., and Ki, K.-N. (2011). Application of a statistical disclosure control techniques based on multiplicative noise. Korean Journal of Applied Statistics, 24, 127-136. https://doi.org/10.5351/KJAS.2011.24.1.127
  27. Kinney, S. K. and Reiter, J. P. (2007). Making public use, synthetic files of the Longitudinal Business Database, In Proceedings of the Joint Statistical Meetings, American Statistical Association, Alexandria, VA.
  28. Kinney, S. K., Reiter, J. P., Reznek, A. P., Miranda, J., Jarmin, R. S., and Abowd, J. M. (2011). Towards unrestricted public use business microdata: the synthetic longitudinal business database. International Statistical Review, 79, 363-384.
  29. Krenzke, T., Gentleman, J. F., Li, J. and Moriarity, C. (2013). Addressing disclosure concerns and analysis demands in a Real-Time Online Analytic System. Journal of Official Statistics, 29, 99-124.
  30. Lee, Y. (2013). Review on statistical methods for protecting privacy and measuring risk of disclosure when releasing information for public use. Journal of the Korean Data and Information Science Society, 24, 1029-1041. https://doi.org/10.7465/jkdi.2013.24.5.1029
  31. Lee, Y. H. and Kim, Y. D. (2011). Statistical disclosure control for EduData (the original title is written in Korean), Korea Eduation & Research Information Service, Daegu, Korea.
  32. Lucero, J., Zayatz, L., Singh, L., You, J., DePersio, M., and Freiman, M. (2011). The current stage of the microdata analysis system at the U.S. Census Bureau, In Proceedings of the 58th World Statistical Congress of the International Statistical Institute.
  33. Machanavajjhala, A., Kifer, D., Abowd, J., Gehrke, J., and Vilhuber, L. (2008). Privacy: theory meets practice on the map, In IEEE 24th International Conference on Data Engineering, 277-286.
  34. Manrique-Vallier, D. and Reiter, J. (2012). Estimating identification disclosure risk using mixed membership models. Journal of the American Statistical Association, 107, 1385-1394. https://doi.org/10.1080/01621459.2012.710508
  35. Matthews, G. J. and Harel, O. (2011). Data confidentiality: A review of methods for statistical disclosure limitation and methods for accessing privacy. Statistics Surveys, 5, 1-29 https://doi.org/10.1214/11-SS074
  36. McClure, D. and Reiter, J. P. (2012). Differential privacy and statistical disclosure risk measures: An investigation with binary synthetic data. Transactions on Data Privacy, 5, 535-552.
  37. Meindl, B., Templ, M., and Kowarik, A. (2013). Guidelines for the Anonymization of Microdata Using R-package sdcMicro.
  38. Muralidhar, K., O'Keefe, C. M. and Sarathy, R. (2013). A general methodology for masking output from remote analysis systems, Paper presented at the October 2013 UNECE Work Session on Statistical Data Confidentiality.
  39. Nguyen, T. T., Xiao, X., Yang, Y., Hui, S. C., Shin, H., and Shin, J. (2016). Collecting and analyzing data from smart device users with local differential privacy, arXiv:1606.05052v1, cs.DB.
  40. Nissim, K., Raskhodnikova, S., and Smith, A. (2007). Smooth sensitivity and sampling in private data analysis, In Proceedings of the Thirty-Ninth Annual ACM Symposium on Theory of Computing, 75-84.
  41. Park, M. J. (2014). Evaluation of microdata masking approaches with Survey of Household Finances and Living Conditions (the original title is written in Korean), Statistical Research Institute, Daejeon.
  42. Park, M. J., Kwon, S. P., and Shim, K. H. (2013). Microdata masking for Survey of Household Finances and Living Conditions (the original title is written in Korean), Statistical Research Institute, Daejeon.
  43. Park, W.-H. (2004). Disclosure limitation techniques for statistical tables and microdata. Journal of The Korean Official Statistics, 9, 146-172.
  44. Raghunathan, T. E., Lepkowski, J. M., Van Hoewyk, J., and Solenberger, P. (2001). A multivariate technique for multiply imputing missing values using a sequence of regression models. Survey Methodology, 27, 85-95.
  45. Raghunathan, T. E., Reiter, J. P., and Rubin, D. B. (2003). Multiple imputation for statistical disclosure limitation. Journal of Official Statistics, 19, 1-16.
  46. Reeder, L. B., Stinson, M., Trageser, K. E., and Vilhuber, L. (2015). Codebook for the SIPP Synthetic Beta 6.0.2., Cornell Institute for Social and Economic Research and Labor Dynamics Institute, Cornell University, Ithaca, NY.
  47. Reiter, J. P. (2003a). Model diagnostics for remote-access regression servers. Statistics and Computing, 13, 371-380. https://doi.org/10.1023/A:1025623108012
  48. Reiter, J. P. (2003b). Inference for partially synthetic, public use microdata sets. Survey Methodology, 29, 181-188.
  49. Reiter, J. P. (2004). New approaches to data dissemination: a glimpse into the future, Chance, 17, 12-16.
  50. Reiter, J. P. (2005). Releasing multiply imputed, synthetic public use microdata: an illustration and empirical study. Journal of the Royal Statistical Society, Series A, 168, 185-205. https://doi.org/10.1111/j.1467-985X.2004.00343.x
  51. Reiter, J. P. and Raghunathan, T. E. (2007). The multiple adaptations of multiple imputation. Journal of the American Statistical Association, 102, 1462-1471. https://doi.org/10.1198/016214507000000932
  52. Rubin, D. B. (1984). Bayesianly justifiable and relevant frequency calculations for the applies statistician. The Annals of Statistics, 12, 1151-1172. https://doi.org/10.1214/aos/1176346785
  53. Rubin, D. B. (1987). Multiple Imputation for Nonresponse in Surveys, John Wiley & Sons, NJ.
  54. Rubin, D. B. (1993). Statistical disclosure limitation. Journal of Official Statistics, 9, 461-468.
  55. Rubin, D. B. and Schenker, N. (1987). Multiple imputation for interval estimation from simple random samples with ignorable nonresponse. Journal of the American Statistical Association, 81, 366-374.
  56. Skinner, C. J. and Holmes, D. J. (1998). Estimating the re-identification risk per record in microdata. Journal of Official Statistics, 14, 361-371.
  57. Skinner, C. and Shlomo, N. (2008). Assessing identification risk in survey microdata using log-linear models. Journal of the American Statistical Association, 103, 989-1001. https://doi.org/10.1198/016214507000001328
  58. Statistics Netherlands (2007). ${\mu}$-Argus User's manual, 4.1 version.
  59. Sweeney, L. (2002). Achieving k-anonymity privacy protection using generalization and suppression. International Journal on Uncertainty, Fuzziness and Knowledge-based Systems, 10, 571-588. https://doi.org/10.1142/S021848850200165X
  60. Templ, M. (2008). Statistical disclosure control for microdata using the R-package sdcMicro. Transactions on Data Privacy, 1, 67-85.
  61. Templ, M. and Meindl, B. (2008). Robustification of microdata masking methods and the comparison with existing method, Privacy in Statistical Database, Springer, 5262, 177-189.
  62. Wasserman, L. and Zhou, S. (2012). A statistical framework for differential privacy. Journal of the American Statistical Association, 105, 375-389.
  63. Woo, M.-J., Reiter, J. P., Oganian, A., and Karr, A. F. (2009). Global measures of data utility for microdata masked for disclosure limitation. The Journal of Privacy and Confidentiality, 1, 111-124.