DOI QR코드

DOI QR Code

Privacy Disclosure and Preservation in Learning with Multi-Relational Databases

  • Guo, Hongyu (Institute for Information Technology, National Research Council of Canada) ;
  • Viktor, Herna L. (School of Electrical Engineering and Computer Science, University of Ottawa) ;
  • Paquet, Eric (Institute for Information Technology, National Research Council of Canada, School of Electrical Engineering and Computer Science, University of Ottawa)
  • Received : 2011.02.01
  • Accepted : 2011.03.20
  • Published : 2011.09.30

Abstract

There has recently been a surge of interest in relational database mining that aims to discover useful patterns across multiple interlinked database relations. It is crucial for a learning algorithm to explore the multiple inter-connected relations so that important attributes are not excluded when mining such relational repositories. However, from a data privacy perspective, it becomes difficult to identify all possible relationships between attributes from the different relations, considering a complex database schema. That is, seemingly harmless attributes may be linked to confidential information, leading to data leaks when building a model. Thus, we are at risk of disclosing unwanted knowledge when publishing the results of a data mining exercise. For instance, consider a financial database classification task to determine whether a loan is considered high risk. Suppose that we are aware that the database contains another confidential attribute, such as income level, that should not be divulged. One may thus choose to eliminate, or distort, the income level from the database to prevent potential privacy leakage. However, even after distortion, a learning model against the modified database may accurately determine the income level values. It follows that the database is still unsafe and may be compromised. This paper demonstrates this potential for privacy leakage in multi-relational classification and illustrates how such potential leaks may be detected. We propose a method to generate a ranked list of subschemas that maintains the predictive performance on the class attribute, while limiting the disclosure risk, and predictive accuracy, of confidential attributes. We illustrate and demonstrate the effectiveness of our method against a financial database and an insurance database.

Keywords

References

  1. P. Berka, "Guide to the financial data set," PKDD 2000 Discovery Challenge: Proceedings of 4th European Conference on Principles and Practice of Knowledge Discovery in Databases, Lyon, France, 2000.
  2. X. Yin, J. Han, J. Yang, and P. S. Yu, "CrossMine: eficient classification across multiple database relations," Proceedings of the 20th International Conference on Data Engineering, Boston, MA, 2004, pp. 399-410.
  3. H. Xiong, M. Steinbach, and V. Kumar, "Privacy leakage in multi-relational databases: a semi-supervised learning perspective," VLDB Journal, vol. 15, no. 4, pp. 388-402, 2006. https://doi.org/10.1007/s00778-006-0011-4
  4. A. Korolova, "Privacy violations using microtargeted ads: a case study," Proceedings of the 10th IEEE International Conference on Data Mining Workshops, Sydney, Australia, 2010, pp. 474-482. https://doi.org/10.1109/ICDMW.2010.137
  5. D. Agrawal and C. C. Aggarwal, "On the design and quantification of privacy preserving data mining algorithms," Proceedings of the 20th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, Santa Barbara, CA, 2001, pp. 247-255.
  6. A. Gkoulalas-Divanis and V. S. Verykios, "An overview of privacy preserving data mining," Crossroads, vol. 15, no. 4, pp. 23-26, Jun. 2009. https://doi.org/10.1145/1558897.1558903
  7. B. C. Chen, D. Kifer, K. LeFevre, and A. Machanavajjhala, "Privacy-preserving data publishing," Foundations and Trends in Databases, vol. 2, no. 1-2, pp. 1-167, 2009. https://doi.org/10.1561/1900000008
  8. S. Matwin and T. Szapiro, "Data privacy: from technology to economics," Advances in Machine Learning II. Studies in Computational Intelligence Vol. 263, J. Koronacki, Z. Ras, S. Wierzchon, and J. Kacprzyk, Eds., Heidelberg: Springer Berlin, pp. 43-74, 2010.
  9. B. C. M. Fung, K. Wang, R. Chen, and P. S. Yu, "Privacy-preserving data publishing: a survey of recent developments," ACM Computing Surveys, vol. 42, no. 4, pp. 1-53, 2010.
  10. J. Domingo-Ferrer and Y. Saygin, "Recent progress in database privacy," Data and Knowledge Engineering, vol. 68, no. 11, pp. 1157-1159, 2009. https://doi.org/10.1016/j.datak.2009.06.002
  11. L. Guo, X. Ying, and X. Wu, "On attribute disclosure in randomization based privacy preserving data publishing," Proceedings of the 10th IEEE International Conference on Data Mining Workshops, Sydney, Austraila, 2010, pp. 466-473. https://doi.org/10.1109/ICDMW.2010.76
  12. Y. Li and H. Shen, "Anonymizing graphs against weight-based attacks," Proceedings of the 10th IEEE International Conference on Data Mining Workshops, Sydney, Austraila, 2010, pp. 491-498. https://doi.org/10.1109/ICDMW.2010.112
  13. L. Singh, C. Schramm, and L. Martin, "Identifying similar neighborhood structures in private social networks," Proceedings of the 10th IEEE International Conference on Data Mining Workshops, Sydney, Austraila, 2010, pp. 507-516. https://doi.org/10.1109/ICDMW.2010.165
  14. P. Samarati, "Protecting respondents identities in microdata release," IEEE Transactions on Knowledge and Data Engineering, vol. 13, no. 6, pp. 1010-1027, 2001. https://doi.org/10.1109/69.971193
  15. R. Agrawal and R. Srikant, "Privacy-preserving data mining," SIGMOD Record (ACM Special Interest Group on Management of Data), vol. 29, no. 2, pp. 439-450, 2000.
  16. M. E. Nergiz, C. Clifton, and A. E. Nergiz, "Multirelational k-anonymity," IEEE Transactions on Knowledge and Data Engineering, vol. 21, no. 8, pp. 1104-1117, 2009. https://doi.org/10.1109/TKDE.2008.210
  17. V. S. Verykios, A. K. Elmagarmid, E. Bertino, Y. Saygin, and E. Dasseni, "Association rule hiding," IEEE Transactions on Knowledge and Data Engineering, vol. 16, no. 4, pp. 434-447, 2004. https://doi.org/10.1109/TKDE.2004.1269668
  18. Z. Zhu and W. Du, "K-anonymous association rule hiding," Proceedings of 5th ACM Symposium on Information, Computer and Communication Security, Beijing, China, 2010, pp. 305-309.
  19. E. Dasseni, V. Verykios, A. Elmagarmid, and E. Bertino, "Hiding association rules by using confidence and support," Information Hiding. Lecture Notes in Computer Science Vol. 2137, I. Moskowitz, Ed., Heidelberg: Springer Berlin, pp. 369-383, 2001. https://doi.org/10.1007/3-540-45496-9_27
  20. Y. Tao, J. Pei, J. Li, K. Xiao, K. Yi, and Z. Xing, "Correlation hiding by independence masking," Proceedings of the 26th IEEE International Conference on Data Engineering, Long Beach, CA, 2010, pp. 964-967.
  21. J. Vaidya and C. Clifton, "Privacy preserving association rule mining in vertically partitioned data," Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Edmonton, Canada, 2002, pp. 639-644.
  22. J. Zhan, S. Matwin, and L. Chang, "Private mining of association rules," IEEE International Conference on Intelligence and Security Informatics, Atlanta, GA, 2005, pp. 72-80.
  23. C. Yao, X. S. Wang, and S. Jajodia, "Checking for k-anonymity violation by views," Proceedings of the 31st International Conference on Very Large Data Bases, Trondheim, Norway, 2005, pp. 910-921.
  24. C. Yao, L. Wang, X. S. Wang, C. Bettini, and S. Jajodia, "Evaluating privacy threats in released database views by symmetric indistinguishability," Journal of Computer Security, vol. 17, no. 1, pp. 5-42, 2009 https://doi.org/10.3233/JCS-2009-0317
  25. H. Kargupta, K. Das, and K. Liu, "Multi-party, privacy-preserving distributed data mining using a game theoretic framework," Proceedings of the 11th European Conference on Principles and Practice of Knowledge Discovery in Databases, Warsaw, Poland, 2007, pp. 523-531.
  26. J. Han and M. Kamber, Data Mining: Concepts and Techniques, 2nd ed., San Francisco, CA: Morgan Kaufmann, 2006.
  27. H. Blockeel and L. De Raedt, "Top-down induction of first-order logical decision trees," Artificial Intelligence, vol. 101, no. 1-2, pp. 285-297, 1998. https://doi.org/10.1016/S0004-3702(98)00034-4
  28. J. R. Quinlan and R. M. Cameron-Jones, "FOIL: a midterm report," Proceedings of the European Conference on Machine Learning, Vienna, Austria, 1993, pp. 3-20.
  29. H. Guo and H. L. Viktor, "Mining relational data through correlation- based multiple view validation," Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Philadelphia, PA, 2006, pp. 567-573.
  30. H. Guo, H. L. Viktor, and E. Paquet, "Pruning relations for substructure discovery of multi-relational databases," Proceedings of the 11th European Conference on Principles and Practice of Knowledge Discovery in Databases, Warsaw, Poland, 2007, pp. 462-470.
  31. J. R. Quinlan, C4.5: Programs for Machine Learning, San Mateo, CA: Morgan Kaufmann Publishers, 1993.
  32. C. J. C. Burges, "A tutorial on support vector machines for pattern recognition," Data Mining and Knowledge Discovery, vol. 2, no. 2, pp. 121-167, 1998. https://doi.org/10.1023/A:1009715923555
  33. E. E. Ghiselli, Theory of Psychological Measurement, New York, NY: McGraw-Hill, 1964.
  34. R. Hogarth, "Methods for aggregating opinions," Decision Making and Change in Human Affairs: Proceedings of the 5th Research Conference on Subjective Probability, Utility, and Decision Making, Darmstadt, Germany, 1975, pp. 231-255.
  35. R. B. Zajonc, "A note on group judgements and group size," Human Relations, vol. 15, no. 2, pp. 177-180, May 1962. https://doi.org/10.1177/001872676201500206
  36. M. Hall, "Correlation-based feature selection for machine learning," Ph.D. dissertation, Waikato University, Hamilton, New Zealand, 1998.
  37. H. Guo and H. L. Viktor, "Multirelational classification: a multiple view approach," Knowledge and Information Systems, vol. 17, no. 3, pp. 287-312, 2008. https://doi.org/10.1007/s10115-008-0127-5
  38. W. H. Press, Numerical Recipes in C: The Art of Scientific Computing, Cambridge, UK: Cambridge University Press, 1988.
  39. R. Kohavi and G. H. John, "Wrappers for feature subset selection," Artificial Intelligence, vol. 97, no. 1-2, pp. 273-324, 1997. https://doi.org/10.1016/S0004-3702(97)00043-X
  40. M. A. Krogel and S. Wrobel, "Facets of aggregation approaches to propositionalization," Proceedings of the Work-in-Progress Track at the 13th International Conference on Inductive Logic Programming, Szeged, Hungary, 2003, pp. 30-39.
  41. I. H. Witten and E. Frank, Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations, San Francisco, CA: Morgan Kaufmann, 2000.
  42. H. Guo, H. L. Viktor, and E. Paquet, "Identifying and preventing data leakage in multi-relational classification," Proceedings of the 10th IEEE International Conference on Data Mining Workshops, Sydney, Austraila, 2010, pp. 458-465. https://doi.org/10.1109/ICDMW.2010.33

Cited by

  1. SAPDS: self-healing attribute-based privacy aware data sharing in cloud vol.62, pp.1, 2012, https://doi.org/10.1007/s11227-011-0727-9