Privacy Disclosure and Preservation in Learning with Multi-Relational Databases

Guo, Hongyu;Viktor, Herna L.;Paquet, Eric;

doi:10.5626/JCSE.2011.5.3.183

Journal of Computing Science and Engineering

Volume 5 Issue 3
/
Pages.183-196
/
2011
/
1976-4677(pISSN)
/
2093-8020(eISSN)

Korean Institute of Information Scientists and Engineers (한국정보과학회)

DOI QR Code

Privacy Disclosure and Preservation in Learning with Multi-Relational Databases

Guo, Hongyu (Institute for Information Technology, National Research Council of Canada) ;
Viktor, Herna L. (School of Electrical Engineering and Computer Science, University of Ottawa) ;
Paquet, Eric (Institute for Information Technology, National Research Council of Canada, School of Electrical Engineering and Computer Science, University of Ottawa)

Received : 2011.02.01
Accepted : 2011.03.20
Published : 2011.09.30

https://doi.org/10.5626/JCSE.2011.5.3.183 Citation PDF KPUBS

Download PDF

⟨ Previous Next ⟩

Abstract

There has recently been a surge of interest in relational database mining that aims to discover useful patterns across multiple interlinked database relations. It is crucial for a learning algorithm to explore the multiple inter-connected relations so that important attributes are not excluded when mining such relational repositories. However, from a data privacy perspective, it becomes difficult to identify all possible relationships between attributes from the different relations, considering a complex database schema. That is, seemingly harmless attributes may be linked to confidential information, leading to data leaks when building a model. Thus, we are at risk of disclosing unwanted knowledge when publishing the results of a data mining exercise. For instance, consider a financial database classification task to determine whether a loan is considered high risk. Suppose that we are aware that the database contains another confidential attribute, such as income level, that should not be divulged. One may thus choose to eliminate, or distort, the income level from the database to prevent potential privacy leakage. However, even after distortion, a learning model against the modified database may accurately determine the income level values. It follows that the database is still unsafe and may be compromised. This paper demonstrates this potential for privacy leakage in multi-relational classification and illustrates how such potential leaks may be detected. We propose a method to generate a ranked list of subschemas that maintains the predictive performance on the class attribute, while limiting the disclosure risk, and predictive accuracy, of confidential attributes. We illustrate and demonstrate the effectiveness of our method against a financial database and an insurance database.

Keywords

References

P. Berka, "Guide to the financial data set," PKDD 2000 Discovery Challenge: Proceedings of 4th European Conference on Principles and Practice of Knowledge Discovery in Databases, Lyon, France, 2000.
X. Yin, J. Han, J. Yang, and P. S. Yu, "CrossMine: eficient classification across multiple database relations," Proceedings of the 20th International Conference on Data Engineering, Boston, MA, 2004, pp. 399-410.
H. Xiong, M. Steinbach, and V. Kumar, "Privacy leakage in multi-relational databases: a semi-supervised learning perspective," VLDB Journal, vol. 15, no. 4, pp. 388-402, 2006. https://doi.org/10.1007/s00778-006-0011-4
A. Korolova, "Privacy violations using microtargeted ads: a case study," Proceedings of the 10th IEEE International Conference on Data Mining Workshops, Sydney, Australia, 2010, pp. 474-482. https://doi.org/10.1109/ICDMW.2010.137
D. Agrawal and C. C. Aggarwal, "On the design and quantification of privacy preserving data mining algorithms," Proceedings of the 20th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, Santa Barbara, CA, 2001, pp. 247-255.
A. Gkoulalas-Divanis and V. S. Verykios, "An overview of privacy preserving data mining," Crossroads, vol. 15, no. 4, pp. 23-26, Jun. 2009. https://doi.org/10.1145/1558897.1558903
B. C. Chen, D. Kifer, K. LeFevre, and A. Machanavajjhala, "Privacy-preserving data publishing," Foundations and Trends in Databases, vol. 2, no. 1-2, pp. 1-167, 2009. https://doi.org/10.1561/1900000008
S. Matwin and T. Szapiro, "Data privacy: from technology to economics," Advances in Machine Learning II. Studies in Computational Intelligence Vol. 263, J. Koronacki, Z. Ras, S. Wierzchon, and J. Kacprzyk, Eds., Heidelberg: Springer Berlin, pp. 43-74, 2010.
B. C. M. Fung, K. Wang, R. Chen, and P. S. Yu, "Privacy-preserving data publishing: a survey of recent developments," ACM Computing Surveys, vol. 42, no. 4, pp. 1-53, 2010.
J. Domingo-Ferrer and Y. Saygin, "Recent progress in database privacy," Data and Knowledge Engineering, vol. 68, no. 11, pp. 1157-1159, 2009. https://doi.org/10.1016/j.datak.2009.06.002
L. Guo, X. Ying, and X. Wu, "On attribute disclosure in randomization based privacy preserving data publishing," Proceedings of the 10th IEEE International Conference on Data Mining Workshops, Sydney, Austraila, 2010, pp. 466-473. https://doi.org/10.1109/ICDMW.2010.76
Y. Li and H. Shen, "Anonymizing graphs against weight-based attacks," Proceedings of the 10th IEEE International Conference on Data Mining Workshops, Sydney, Austraila, 2010, pp. 491-498. https://doi.org/10.1109/ICDMW.2010.112
L. Singh, C. Schramm, and L. Martin, "Identifying similar neighborhood structures in private social networks," Proceedings of the 10th IEEE International Conference on Data Mining Workshops, Sydney, Austraila, 2010, pp. 507-516. https://doi.org/10.1109/ICDMW.2010.165
P. Samarati, "Protecting respondents identities in microdata release," IEEE Transactions on Knowledge and Data Engineering, vol. 13, no. 6, pp. 1010-1027, 2001. https://doi.org/10.1109/69.971193
R. Agrawal and R. Srikant, "Privacy-preserving data mining," SIGMOD Record (ACM Special Interest Group on Management of Data), vol. 29, no. 2, pp. 439-450, 2000.
M. E. Nergiz, C. Clifton, and A. E. Nergiz, "Multirelational k-anonymity," IEEE Transactions on Knowledge and Data Engineering, vol. 21, no. 8, pp. 1104-1117, 2009. https://doi.org/10.1109/TKDE.2008.210
V. S. Verykios, A. K. Elmagarmid, E. Bertino, Y. Saygin, and E. Dasseni, "Association rule hiding," IEEE Transactions on Knowledge and Data Engineering, vol. 16, no. 4, pp. 434-447, 2004. https://doi.org/10.1109/TKDE.2004.1269668
Z. Zhu and W. Du, "K-anonymous association rule hiding," Proceedings of 5th ACM Symposium on Information, Computer and Communication Security, Beijing, China, 2010, pp. 305-309.
E. Dasseni, V. Verykios, A. Elmagarmid, and E. Bertino, "Hiding association rules by using confidence and support," Information Hiding. Lecture Notes in Computer Science Vol. 2137, I. Moskowitz, Ed., Heidelberg: Springer Berlin, pp. 369-383, 2001. https://doi.org/10.1007/3-540-45496-9_27
Y. Tao, J. Pei, J. Li, K. Xiao, K. Yi, and Z. Xing, "Correlation hiding by independence masking," Proceedings of the 26th IEEE International Conference on Data Engineering, Long Beach, CA, 2010, pp. 964-967.
J. Vaidya and C. Clifton, "Privacy preserving association rule mining in vertically partitioned data," Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Edmonton, Canada, 2002, pp. 639-644.
J. Zhan, S. Matwin, and L. Chang, "Private mining of association rules," IEEE International Conference on Intelligence and Security Informatics, Atlanta, GA, 2005, pp. 72-80.
C. Yao, X. S. Wang, and S. Jajodia, "Checking for k-anonymity violation by views," Proceedings of the 31st International Conference on Very Large Data Bases, Trondheim, Norway, 2005, pp. 910-921.
C. Yao, L. Wang, X. S. Wang, C. Bettini, and S. Jajodia, "Evaluating privacy threats in released database views by symmetric indistinguishability," Journal of Computer Security, vol. 17, no. 1, pp. 5-42, 2009 https://doi.org/10.3233/JCS-2009-0317
H. Kargupta, K. Das, and K. Liu, "Multi-party, privacy-preserving distributed data mining using a game theoretic framework," Proceedings of the 11th European Conference on Principles and Practice of Knowledge Discovery in Databases, Warsaw, Poland, 2007, pp. 523-531.
J. Han and M. Kamber, Data Mining: Concepts and Techniques, 2nd ed., San Francisco, CA: Morgan Kaufmann, 2006.
H. Blockeel and L. De Raedt, "Top-down induction of first-order logical decision trees," Artificial Intelligence, vol. 101, no. 1-2, pp. 285-297, 1998. https://doi.org/10.1016/S0004-3702(98)00034-4
J. R. Quinlan and R. M. Cameron-Jones, "FOIL: a midterm report," Proceedings of the European Conference on Machine Learning, Vienna, Austria, 1993, pp. 3-20.
H. Guo and H. L. Viktor, "Mining relational data through correlation- based multiple view validation," Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Philadelphia, PA, 2006, pp. 567-573.
H. Guo, H. L. Viktor, and E. Paquet, "Pruning relations for substructure discovery of multi-relational databases," Proceedings of the 11th European Conference on Principles and Practice of Knowledge Discovery in Databases, Warsaw, Poland, 2007, pp. 462-470.
J. R. Quinlan, C4.5: Programs for Machine Learning, San Mateo, CA: Morgan Kaufmann Publishers, 1993.
C. J. C. Burges, "A tutorial on support vector machines for pattern recognition," Data Mining and Knowledge Discovery, vol. 2, no. 2, pp. 121-167, 1998. https://doi.org/10.1023/A:1009715923555
E. E. Ghiselli, Theory of Psychological Measurement, New York, NY: McGraw-Hill, 1964.
R. Hogarth, "Methods for aggregating opinions," Decision Making and Change in Human Affairs: Proceedings of the 5th Research Conference on Subjective Probability, Utility, and Decision Making, Darmstadt, Germany, 1975, pp. 231-255.
R. B. Zajonc, "A note on group judgements and group size," Human Relations, vol. 15, no. 2, pp. 177-180, May 1962. https://doi.org/10.1177/001872676201500206
M. Hall, "Correlation-based feature selection for machine learning," Ph.D. dissertation, Waikato University, Hamilton, New Zealand, 1998.
H. Guo and H. L. Viktor, "Multirelational classification: a multiple view approach," Knowledge and Information Systems, vol. 17, no. 3, pp. 287-312, 2008. https://doi.org/10.1007/s10115-008-0127-5
W. H. Press, Numerical Recipes in C: The Art of Scientific Computing, Cambridge, UK: Cambridge University Press, 1988.
R. Kohavi and G. H. John, "Wrappers for feature subset selection," Artificial Intelligence, vol. 97, no. 1-2, pp. 273-324, 1997. https://doi.org/10.1016/S0004-3702(97)00043-X
M. A. Krogel and S. Wrobel, "Facets of aggregation approaches to propositionalization," Proceedings of the Work-in-Progress Track at the 13th International Conference on Inductive Logic Programming, Szeged, Hungary, 2003, pp. 30-39.
I. H. Witten and E. Frank, Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations, San Francisco, CA: Morgan Kaufmann, 2000.
H. Guo, H. L. Viktor, and E. Paquet, "Identifying and preventing data leakage in multi-relational classification," Proceedings of the 10th IEEE International Conference on Data Mining Workshops, Sydney, Austraila, 2010, pp. 458-465. https://doi.org/10.1109/ICDMW.2010.33

Cited by

SAPDS: self-healing attribute-based privacy aware data sharing in cloud vol.62, pp.1, 2012, https://doi.org/10.1007/s11227-011-0727-9

Journal of Computing Science and Engineering

Privacy Disclosure and Preservation in Learning with Multi-Relational Databases

Abstract

Keywords

References

Cited by

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)