DOI QR코드

DOI QR Code

MapReduce 환경에서 재그룹핑을 이용한 Locality Sensitive Hashing 기반의 K-Nearest Neighbor 그래프 생성 알고리즘의 개선

An Improvement in K-NN Graph Construction using re-grouping with Locality Sensitive Hashing on MapReduce

  • 이인희 (서울대학교 전기 컴퓨터공학) ;
  • 오혜성 (서울대학교 전기 컴퓨터공학) ;
  • 김형주 (서울대학교 전기 컴퓨터공학)
  • 투고 : 2015.02.06
  • 심사 : 2015.09.14
  • 발행 : 2015.11.15

초록

k-Nearest Neighbor(k-NN)그래프는 모든 노드에 대한 k-NN 정보를 나타내는 데이터 구조로써, 협업 필터링, 유사도 탐색과 여러 정보검색 및 추천 시스템에서 k-NN그래프를 활용하고 있다. 이러한 장점에도 불구하고 brute-force방법의 k-NN그래프 생성 방법은 $O(n^2)$의 시간복잡도를 갖기 때문에 빅데이터 셋에 대해서는 처리가 곤란하다. 따라서, 고차원, 희소 데이터에 효율적인 Locality Sensitive Hashing 기법을 (key, value)기반의 분산환경인 MapReduce환경에서 사용하여 k-NN그래프를 생성하는 알고리즘이 연구되고 있다. Locality Sensitive Hashing 기법을 사용하여 사용자를 이웃후보 그룹으로 만들고 후보내의 쌍에 대해서만 brute-force하게 유사도를 계산하는 two-stage 방법을 MapReduce환경에서 사용하였다. 특히, 그래프 생성과정 중 유사도 계산하는 부분이 가장 많은 시간이 소요되므로 후보 그룹을 어떻게 만드는 것인지가 중요하다. 기존의 방법은 사이즈가 큰 후보그룹을 방지하는데 한계점이 있다. 본 논문에서는 효율적인 k-NN 그래프 생성을 위하여 사이즈가 큰 후보그룹을 재구성하는 알고리즘을 제시하였다. 실험을 통해 본 논문에서 제안한 알고리즘이 그래프의 정확성, Scan Rate측면에서 좋은 성능을 보임을 확인하였다.

The k nearest neighbor (k-NN) graph construction is an important operation with many web-related applications, including collaborative filtering, similarity search, and many others in data mining and machine learning. Despite its many elegant properties, the brute force k-NN graph construction method has a computational complexity of $O(n^2)$, which is prohibitive for large scale data sets. Thus, (Key, Value)-based distributed framework, MapReduce, is gaining increasingly widespread use in Locality Sensitive Hashing which is efficient for high-dimension and sparse data. Based on the two-stage strategy, we engage the locality sensitive hashing technique to divide users into small subsets, and then calculate similarity between pairs in the small subsets using a brute force method on MapReduce. Specifically, generating a candidate group stage is important since brute-force calculation is performed in the following step. However, existing methods do not prevent large candidate groups. In this paper, we proposed an efficient algorithm for approximate k-NN graph construction by regrouping candidate groups. Experimental results show that our approach is more effective than existing methods in terms of graph accuracy and scan rate.

키워드

참고문헌

  1. A. Das, M. Datar, A. Garg, and S. Rajaram, "Google news personalization: scalable online collaborative filtering," Proc. 16th Int. Conf., pp. 271-280, 2007.
  2. W. Dong, C. Moses, and K. Li, "Efficient k-nearest neighbor graph construction for generic similarity measures," Proc. 20th Int. Conf. World wide web - WWW'11, pp. 577-586, 2011.
  3. M. R. Brito, E. L. Chavez, A. J. Quiroz, and J. E. Yukich, "Connectivity of the mutual k-nearest-neighbor graph in clustering and outlier detection," Statistics & Probability Letters, Vol. 35. pp. 33-42, 1997. https://doi.org/10.1016/S0167-7152(96)00213-1
  4. O. Boiman, E. Shechtman, and M. Irani, "In defense of nearest-neighbor based image classification," 26th IEEE Conference on Computer Vision and Pattern Recognition, CVPR, 2008.
  5. Y. Zhang, K. Huang, G. Geng, and C. Liu, "Fast k NN Graph Construction with Locality Sensitive Hashing," Knowl. Discov. Databases, pp. 660-674, 2013.
  6. J. Chen, H. Fang, and Y. Saad, "Fast Approximate kNN Graph Construction for High Dimensional Data via Recursive Lanczos Bisection," J. Mach. Learn. Res., Vol. 10, No. 2009, pp. 1989-2012, 2009.
  7. Y. Park, S. Park, S. Lee, and W. Jung, "Fast collaborative filtering with a k-nearest neighbor graph," BigComp, pp. 92-95, 2014.
  8. J. L. Bentley, "Multidimensional binary search trees used for associative searching," Communications of the ACM, Vol. 18. pp. 509-517, 1975. https://doi.org/10.1145/361002.361007
  9. A. Guttman, "R-trees: A Dynamic Index Structure for Spatial Searching," Proc. of the 1984 ACM SIGMOD International Conference on Management of Data - SIGMOD'84, pp. 47-57, 1984.
  10. R. Weber, H. J. Schek, and S. Blott, "A Quantitative Analysis and Performance Study for Similarity-Search Methods in High-Dimensional Spaces," Proc. 24th VLDB Conf., Vol. New York C, pp. 194-205, 1998.
  11. P. Indyk and R. Motwani, "Approximate nearest neighbors: towards removing the curse of dimensionality," STOC'98: Proc. of the thirtieth annual ACM symposium on Theory of computing, pp. 604-613, 1998.
  12. E. Kushilevitz, R. Ostrovsky, and Y. Rabani, "Efficient search for approximate nearest neighbor in high dimensional spaces," STOC'98: Proc. of the thirtieth annual ACM symposium on Theory of computing, pp. 614-623, 1998.
  13. L. Li, D. Wang, T. Li, D. Knox, and B. Padmanabhan, "SCENE: a scalable two-stage personalized news recommendation system," SIGIR, pp. 125-134, 2011.
  14. L. Hsieh and G. Wu, "Two-stage sparse graph construction using MinHash on MapReduce," ICASSP, pp. 1013-1016, 2012.
  15. "Apache Hadoop," [Online]. Available: http://hadoop.apache.org/.
  16. J. Dean and S. Ghemawat, "MapReduce : Simplified Data Processing on Large Clusters," Commun. ACM, Vol. 51, pp. 1-13, 2008.
  17. Y. Kwon and M. Balazinska, "A study of skew in mapreduce applications," Open Cirrus Summit, 2011.
  18. R. Szmit, "Locality Sensitive Hashing for Similarity Search Using MapReduce on Large Scale Data," IIS, 2013, Vol. 7912, No. LNCS, pp. 171-178.
  19. A. Z. Broder, "On the resemblance and containment of documents," Proc. Compression Complex. Seq. 1997 (Cat. No.97TB100171), 1997.
  20. Z. Yang, W. Oop, and Q. Sun, "Hierarchical nonuniform locally sensitive hashing and its application to video identification," ICIP, pp. 743-746, 2004.
  21. "MovieLens," [Online]. Available: http://grouplens.org/datasets/movielens/.
  22. "NYTimes news articles," [Online]. Available: https://archive.ics.uci.edu/ml/datasets/Bag+of+Words.