DOI QR코드

DOI QR Code

주성분 분석의 안전한 다자간 계산

Secure Multiparty Computation of Principal Component Analysis

  • 김상필 (강원대학교 컴퓨터과학) ;
  • 이상훈 (강원대학교 컴퓨터과학) ;
  • 길명선 (강원대학교 컴퓨터과학) ;
  • 문양세 (강원대학교 컴퓨터과학) ;
  • 원희선 (한국전자통신연구원 빅데이터SW플랫폼연구부)
  • 투고 : 2015.01.29
  • 심사 : 2015.05.01
  • 발행 : 2015.07.15

초록

최근 대용량 데이터 대상의 프라이버시 보호 데이터 마이닝(privacy-preserving data mining: PPDM)이 활발히 연구되고 있다. 본 논문에서는 민감한 데이터 집합의 상관관계를 파악하는데 널리 사용되는 주성분 분석 기반의 PPDM을 다룬다. 일반적으로 주성분 분석은 모든 데이터를 한 곳에 모아 처리해야 하므로 민감한 데이터가 서로에게 공개되고, 상당한 계산량을 요구하며, 또한 데이터를 모으는 과정에서 많은 통신 오버헤드가 발생한다. 이러한 문제를 해결하기 위하여 본 논문은 데이터를 한 곳에 모으지 않고도 주성분 분석을 안전하게 계산하는 효율적인 방법을 제안한다. 제안하는 방법은 노드들 간에 한정된 정보만을 공유하면서도 원래의 주성분 분석 결과와 동일한 결과를 얻을 수 있다. 또한 안전한 주성분 분석에 저차원 변환을 적용하여 안전한 유사 문서 검색에 사용한다. 마지막으로 다양한 실험을 통해 제안한 방법이 대용량의 다차원 데이터에서 효율적으로 동작함을 확인한다.

In recent years, many research efforts have been made on privacy-preserving data mining (PPDM) in data of large volume. In this paper, we propose a PPDM solution based on principal component analysis (PCA), which can be widely used in computing correlation among sensitive data sets. The general method of computing PCA is to collect all the data spread in multiple nodes into a single node before starting the PCA computation; however, this approach discloses sensitive data of individual nodes, involves a large amount of computation, and incurs large communication overheads. To solve the problem, in this paper, we present an efficient method that securely computes PCA without the need to collect all the data. The proposed method shares only limited information among individual nodes, but obtains the same result as that of the original PCA. In addition, we present a dimensionality reduction technique for the proposed method and use it to improve the performance of secure similar document detection. Finally, through various experiments, we show that the proposed method effectively and efficiently works in a large amount of multi-dimensional data.

키워드

과제정보

연구 과제 주관 기관 : 한국연구재단

참고문헌

  1. S.-K Hong, Y.-S. Moon, and H.-S. Kim, "Privacy-Preserving Time-Series Data Mining," Journal of KIISE: Databases, Vol. 40, No. 2, pp. 124-133, Apr. 2013. (in Korean)
  2. C. Clifton, M. Kantarcioglu, J. Vaidya, X. Lin, and M. Y. Zhu, "Tools for Privacy Preserving Distributed Data Mining," Knowledge Discovery and Data Mining Explorations Newsletter, ACM SIGKDD, Vol. 4, Issue 2, pp. 28-34, Jun. 2002.
  3. B. C. M. Fung, K. Wang, R. Chen, and P. S. Yu, "Privacy-Preserving Data Publishing: A Survey of Recent Developments," ACM Computing Surveys, Vol. 42, No. 4, pp. 14-53, Jun. 2010.
  4. W. Du and M. J. Atallah, "Secure Multi-Party Computation Problems and Their Applications - A Review and Open Problems," Proc. of the 2001 Workshop on New Security Paradigms, New Mexico, USA, pp. 13-22, Sept. 2001.
  5. S. Buyrukbilen and S. Bakiras, "Secure Similar Document Detection with Simhash," Proc. of the 2014 Workshop on VLDB-Secure Data Management, SDM 2013, Trento, Italy, pp. 61-75, Aug. 2013.
  6. Y. Peng, G. Kou, Y. Shi, and Z. Chen, "Privacy-Preserving Data Mining for Medical Data: Application of Data Partition Methods," Communications and Discoveries from Multidisciplinary Data, Vol. 123, pp. 331-340, Oct. 2008. https://doi.org/10.1007/978-3-540-78733-4_20
  7. S. Kim, M.-K. Sung, and Y.-D. Chung, "A Delayfree Anonymization Method for Preserving Privacy of Data Streams," Journal of KIISE: Databases, Vol. 40, No. 6, pp. 411-422, Dec. 2013. (in Korean)
  8. A. Sharma and K. K. Paliwal, "Fast Principal Component Analysis Using Fixed-point Algorithm," Pattern Recognition Letters, Vol. 28, No. 1, pp. 1151-1155, Jan. 2007. https://doi.org/10.1016/j.patrec.2007.01.012
  9. R. P. Browne and P. D. McNicholas, "Estimating Common Principal Components in High Dimensions," Journal of Data Analysis and Classification, Vol. 8, No. 2, pp. 217-226, Jun. 2014. https://doi.org/10.1007/s11634-013-0139-1
  10. K. L. Elmore and M. B. Richman, "Euclidean Distance as a Similarity Metric for Principal Component Analysis," Journal of American Meteorological Society, Vo1. 129, Issue 3, pp. 540-549, Mar. 2001.
  11. M. Lu, H.-S. Lee, D. Hadley, J. Z. Huang, and X. Qian, "Logistic Principal Component Analysis for Rare Variants in Gene-Environment Interaction Analysis," IEEE Trans. on Computational Biology and Bioinformatics, Vol. 11, No. 6, pp. 1020-102, Nov. 2014. https://doi.org/10.1109/TCBB.2014.2322371
  12. L.-C. Yu and C.-Y. Ho, "Identifying Emotion Labels from Psychiatric Social Texts Using Independent Component Analysis," Proc. the 25th Int'l Conf. on Computational Linguistics, Dublin, Ireland, pp. 837-847, Aug. 2014.
  13. S.-K. Hong, S.-P. Kim, H. S. Lim, and Y.-S. Moon, "Secure Multi-Party Computation of Correlation Coefficients," Journal of KIISE: Databases, Vol. 41, No. 10, pp. 799-809, Oct. 2014. (in Korean) https://doi.org/10.5626/JOK.2014.41.10.799
  14. W. Jiang, M. Murugesan, C, Clifton, and L. Si, "Similar Document Detection with Limited Information Disclosure," Proc. of the 24th IEEE Int'l Conf. on Data Engineering, Cancun, Mexico, pp. 735-743, Apr. 2008.
  15. M. Murugesan, W. Jiang, C. Clifton, L. Si, and J. Vaidya, "Efficient Privacy-Preserving Similar Document Detection," Journal on Very Large Data Bases, Vol. 19, No. 4, pp. 457-475, Aug. 2010. https://doi.org/10.1007/s00778-009-0175-9
  16. K. Yang and C. Shahabi, "A PCA-based Similarity Measure for Multivariate Time Series," Proc. of the 2nd ACM Int'l Workshop on Multimedia Databases, Washington DC, pp. 65-74, Nov. 2004.
  17. National Climate Data Center, [Online]. Available: http://www.ncdc.noaa.gov.
  18. UCI, [Online]. Available: https://archive.ics.uci.edu/ml/datasets/Bag+of+Words

피인용 문헌

  1. Secure principal component analysis in multiple distributed nodes vol.9, pp.14, 2016, https://doi.org/10.1002/sec.1501