Browse > Article
http://dx.doi.org/10.5351/KJAS.2016.29.6.1041

Statistical disclosure control for public microdata: present and future  

Park, Min-Jeong (Statistical Research Institute, Statistics Korea)
Kim, Hang J. (Department of Mathematical Sciences, University of Cincinnati)
Publication Information
The Korean Journal of Applied Statistics / v.29, no.6, 2016 , pp. 1041-1059 More about this Journal
Abstract
The increasing demand from researchers and policy makers for microdata has also increased related privacy and security concerns. During the past two decades, a large volume of literature on statistical disclosure control (SDC) has been published in international journals. This review paper introduces relatively recent SDC approaches to the communities of Korean statisticians and statistical agencies. In addition to the traditional masking techniques (such as microaggregation and noise addition), we introduce an online analytic system, differential privacy, and synthetic data. For each approach, the application example (with pros and cons, as well as methodology) is highlighted, so that the paper can assist statical agencies that seek a practical SDC approach.
Keywords
data privacy; masking; analytic system; differential privacy; synthetic data;
Citations & Related Records
Times Cited By KSCI : 5  (Citation Analysis)
연도 인용수 순위
1 Sweeney, L. (2002). Achieving k-anonymity privacy protection using generalization and suppression. International Journal on Uncertainty, Fuzziness and Knowledge-based Systems, 10, 571-588.   DOI
2 Templ, M. (2008). Statistical disclosure control for microdata using the R-package sdcMicro. Transactions on Data Privacy, 1, 67-85.
3 Templ, M. and Meindl, B. (2008). Robustification of microdata masking methods and the comparison with existing method, Privacy in Statistical Database, Springer, 5262, 177-189.
4 Wasserman, L. and Zhou, S. (2012). A statistical framework for differential privacy. Journal of the American Statistical Association, 105, 375-389.
5 Woo, M.-J., Reiter, J. P., Oganian, A., and Karr, A. F. (2009). Global measures of data utility for microdata masked for disclosure limitation. The Journal of Privacy and Confidentiality, 1, 111-124.
6 Abowd, J. M. and Woodcock, S. D. (2001). Disclosure limitation in longitudinal linked data, In P. Doyle, J. Lane, L. Zayatz, and J. Theeuwes (Eds), Confidentiality, Disclosure, and Data Access: Theory and Practical Applications for Statistical Agencies (pp. 215-277), North-Holland, Amsterdam.
7 Abowd, J. M. and Woodcock, S. D. (2004). Multiply-imputing confidential characteristics and file links in longitudinal linked data, In Privacy in Statistical Databases (pp. 290-297), Springer Berlin, Heidelberg.
8 Bethlehem, J. G., Keller, W. J., and Panneko, J. (1990). Disclosure control of microdata. Journal of the American Statistical Association, 85, 38-45.   DOI
9 Blum, A., Dwork, C., McSherry, F., and Nissim, K. (2005). Practical privacy: The SuLQ framework, In Proceedings of the 24th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (pp. 128-138), Association for Computing Machinery, New York.
10 Chipperfield, J. and Yu, F. (2011). Protecting confidentiality in a remote analysis server for tabulation and analysis of data, Paper presented at the October 2011 UNECE Work Session on Statistical Data Confidentiality.
11 Drechsler, J. (2012). New data dissemination approaches in old Europe - synthetic datasets for a German establishment survey. Journal of Applied Statistics, 39, 243-265.   DOI
12 Drechsler, J., Bender, S., and Rassler, S. (2008). Comparing fully and partially synthetic datasets for statistical disclosure control in the German IAB Establishment Panel. Transactions on Data Privacy, 1, 1002-1050.
13 Reiter, J. P. (2004). New approaches to data dissemination: a glimpse into the future, Chance, 17, 12-16.
14 Reeder, L. B., Stinson, M., Trageser, K. E., and Vilhuber, L. (2015). Codebook for the SIPP Synthetic Beta 6.0.2., Cornell Institute for Social and Economic Research and Labor Dynamics Institute, Cornell University, Ithaca, NY.
15 Reiter, J. P. (2003a). Model diagnostics for remote-access regression servers. Statistics and Computing, 13, 371-380.   DOI
16 Reiter, J. P. (2003b). Inference for partially synthetic, public use microdata sets. Survey Methodology, 29, 181-188.
17 Reiter, J. P. (2005). Releasing multiply imputed, synthetic public use microdata: an illustration and empirical study. Journal of the Royal Statistical Society, Series A, 168, 185-205.   DOI
18 Drechsler, J. and Reiter, J. P. (2009). Disclosure risk and data utility for partially synthetic data: an empirical study using the German IAB Establishment Survey. Journal of Official Statistics, 25, 589-603.
19 Drechsler, J. and Reiter, J. P. (2011). An empirical evaluation of easily implemented, nonparametric methods for generating synthetic datasets. Computational Statistics and Data Analysis, 55, 3232-3243.   DOI
20 Abowd, J. M., Stinson, M., and Benedetto, G. (2006). Final report to the Social Security Administration on the SIPP/SSA/IRS public use file project, Technical Report, U.S. Census Bureau Longitudinal Employer-Household Dynamics Program.
21 Abowd, J. M. and Vilhuber, L. (2008). How protective are synthetic data? In J. Domingo-Ferrer and Y. Saygin (Eds), Privacy in Statistical Databases (pp. 239-246), Springer-Verlag Berlin, Heidelberg.
22 Duncan, G. T., Elliot, M., and Gonzalez J. J. S. (2011). Statistical confidentiality: principles and practice, Springer.
23 Duncan, G. and Lambert, D. (1989). The risk of disclosure for microdata. Journal of Business & Economic Statistics, 7, 207-217.
24 Dwork, C. (2006). Differential Privacy, In Inference Control in Statistical Databases (pp. 1-12), Springer, Berlin, Heidelberg.
25 Dwork, C., McSherry, F., Nissim, K., and Smith, A. (2006). Calibrating noise to sensitive in private data analysis, In Proceedings of the 3rd Theory of Cryptography Conference (pp. 265-284), Springer, New York.
26 Dwork, C. and Smith, A. (2009). Differential privacy for statistics: What we know and what we want to learn. Journal of Privacy and Confidentiality, 1, 135-154.
27 Franconi, L. and Polettini, S. (2004). Individual risk estimation in ${\mu}$-Argus: a review, In Privacy in Statistical Databases (pp. 262-272), Springer, New York.
28 Jeong, D. M. and Jeong, M. (2008). A method of masking for 2005 Korean Census microdata. Korean Journal of Applied Statistics, 21, 313-325.   DOI
29 Jeong, D. M. and Kang, D. H. (2006). Disclosure control methods to increase microdata usage (the original title is written in Korean), Daejeon, Korea.
30 Jeong, D. M., Kim, J. J., and Kim, K. M. (2009). A method of masking based on multiplicative noise. Korean Journal of Applied Statistics, 22, 141-151.   DOI
31 Karr, A. F., Kohnen, C. N., Oganian, A. Reiter, J. P., and Sanil, A. P. (2006). A framework for evaluating the utility of data altered to protect confidentiality. The American Statistician, 60, 1-9.   DOI
32 Kim, H. J., Karr, A. F., and Reiter, J. P. (2015). Statistical disclosure limitation in the presence of edit rules. Journal of Official Statistics, 31, 1-18   DOI
33 Kim, K., Lee, E., and Jeong, M. (2007). A case study on the overseas release system of microdata, Statistical Research Institute.
34 Kim, K.-S. (2009). Release of microdata and statistical disclosure control techniques. Communications for Statistical Applications and Methods, 16, 1-11.   DOI
35 Kim, K. Y., Kwon, D. H., Shin, J. E., and Lee. S. H. (2011). Introduction to Statistical Disclosure Control (the original title is written in Korean), Freeacademy, Gyeonggi-do.
36 Kim, Y.-W., Kim, T.-Y., and Ki, K.-N. (2011). Application of a statistical disclosure control techniques based on multiplicative noise. Korean Journal of Applied Statistics, 24, 127-136.   DOI
37 Kinney, S. K. and Reiter, J. P. (2007). Making public use, synthetic files of the Longitudinal Business Database, In Proceedings of the Joint Statistical Meetings, American Statistical Association, Alexandria, VA.
38 Lee, Y. (2013). Review on statistical methods for protecting privacy and measuring risk of disclosure when releasing information for public use. Journal of the Korean Data and Information Science Society, 24, 1029-1041.   DOI
39 Kinney, S. K., Reiter, J. P., Reznek, A. P., Miranda, J., Jarmin, R. S., and Abowd, J. M. (2011). Towards unrestricted public use business microdata: the synthetic longitudinal business database. International Statistical Review, 79, 363-384.
40 Krenzke, T., Gentleman, J. F., Li, J. and Moriarity, C. (2013). Addressing disclosure concerns and analysis demands in a Real-Time Online Analytic System. Journal of Official Statistics, 29, 99-124.
41 Lee, Y. H. and Kim, Y. D. (2011). Statistical disclosure control for EduData (the original title is written in Korean), Korea Eduation & Research Information Service, Daegu, Korea.
42 Lucero, J., Zayatz, L., Singh, L., You, J., DePersio, M., and Freiman, M. (2011). The current stage of the microdata analysis system at the U.S. Census Bureau, In Proceedings of the 58th World Statistical Congress of the International Statistical Institute.
43 Machanavajjhala, A., Kifer, D., Abowd, J., Gehrke, J., and Vilhuber, L. (2008). Privacy: theory meets practice on the map, In IEEE 24th International Conference on Data Engineering, 277-286.
44 Manrique-Vallier, D. and Reiter, J. (2012). Estimating identification disclosure risk using mixed membership models. Journal of the American Statistical Association, 107, 1385-1394.   DOI
45 Matthews, G. J. and Harel, O. (2011). Data confidentiality: A review of methods for statistical disclosure limitation and methods for accessing privacy. Statistics Surveys, 5, 1-29   DOI
46 Park, M. J. (2014). Evaluation of microdata masking approaches with Survey of Household Finances and Living Conditions (the original title is written in Korean), Statistical Research Institute, Daejeon.
47 McClure, D. and Reiter, J. P. (2012). Differential privacy and statistical disclosure risk measures: An investigation with binary synthetic data. Transactions on Data Privacy, 5, 535-552.
48 Muralidhar, K., O'Keefe, C. M. and Sarathy, R. (2013). A general methodology for masking output from remote analysis systems, Paper presented at the October 2013 UNECE Work Session on Statistical Data Confidentiality.
49 Nguyen, T. T., Xiao, X., Yang, Y., Hui, S. C., Shin, H., and Shin, J. (2016). Collecting and analyzing data from smart device users with local differential privacy, arXiv:1606.05052v1, cs.DB.
50 Nissim, K., Raskhodnikova, S., and Smith, A. (2007). Smooth sensitivity and sampling in private data analysis, In Proceedings of the Thirty-Ninth Annual ACM Symposium on Theory of Computing, 75-84.
51 Park, M. J., Kwon, S. P., and Shim, K. H. (2013). Microdata masking for Survey of Household Finances and Living Conditions (the original title is written in Korean), Statistical Research Institute, Daejeon.
52 Park, W.-H. (2004). Disclosure limitation techniques for statistical tables and microdata. Journal of The Korean Official Statistics, 9, 146-172.
53 Raghunathan, T. E., Lepkowski, J. M., Van Hoewyk, J., and Solenberger, P. (2001). A multivariate technique for multiply imputing missing values using a sequence of regression models. Survey Methodology, 27, 85-95.
54 Rubin, D. B. (1993). Statistical disclosure limitation. Journal of Official Statistics, 9, 461-468.
55 Raghunathan, T. E., Reiter, J. P., and Rubin, D. B. (2003). Multiple imputation for statistical disclosure limitation. Journal of Official Statistics, 19, 1-16.
56 Meindl, B., Templ, M., and Kowarik, A. (2013). Guidelines for the Anonymization of Microdata Using R-package sdcMicro.
57 Reiter, J. P. and Raghunathan, T. E. (2007). The multiple adaptations of multiple imputation. Journal of the American Statistical Association, 102, 1462-1471.   DOI
58 Rubin, D. B. (1984). Bayesianly justifiable and relevant frequency calculations for the applies statistician. The Annals of Statistics, 12, 1151-1172.   DOI
59 Rubin, D. B. (1987). Multiple Imputation for Nonresponse in Surveys, John Wiley & Sons, NJ.
60 Rubin, D. B. and Schenker, N. (1987). Multiple imputation for interval estimation from simple random samples with ignorable nonresponse. Journal of the American Statistical Association, 81, 366-374.
61 Skinner, C. J. and Holmes, D. J. (1998). Estimating the re-identification risk per record in microdata. Journal of Official Statistics, 14, 361-371.
62 Skinner, C. and Shlomo, N. (2008). Assessing identification risk in survey microdata using log-linear models. Journal of the American Statistical Association, 103, 989-1001.   DOI
63 Statistics Netherlands (2007). ${\mu}$-Argus User's manual, 4.1 version.