DOI QR코드

DOI QR Code

대용량 데이터베이스에서 다차원 인덱스를 사용한 효율적인 다단계 k-NN 검색

Efficient Multi-Step k-NN Search Methods Using Multidimensional Indexes in Large Databases

  • 이상훈 (강원대학교 컴퓨터과학과) ;
  • 김범수 (강원대학교 컴퓨터과학과) ;
  • 최미정 (강원대학교 컴퓨터과학과) ;
  • 문양세 (강원대학교 컴퓨터과학과)
  • 투고 : 2014.09.12
  • 심사 : 2014.12.18
  • 발행 : 2015.02.15

초록

본 논문에서는 다차원 인덱스 기반 다단계 k-NN 검색의 성능 향상 문제를 다룬다. 기존 다단계 k-NN 검색에서는 고차원 객체의 저차원 변환으로 인한 정보 손실로 k-NN 질의 결과 매우 큰 허용치(검색 범위)가 결정되어 범위 질의 결과로 많은 후보가 검색된다. 또한, 많은 후보는 후처리 과정에서 매우 많은 I/O 및 CPU 오버헤드를 발생시킨다. 본 논문에서는 이와 같은 고찰에 기반하여 범위 질의의 허용치를 줄여 후보 개수를 줄이고 이를 통해 성능을 향상시키는 방법을 제안한다. 먼저, k-NN 질의 결과로 결정된 허용치를 고차원 및 저차원 객체간 거리 비율로 강제 축소하여 범위 질의에 사용하는 허용치 축소 (근사적) 해결책을 제안한다. 다음으로, k-NN 질의 계수 k 대신 c k 를 사용하여 얻은 보다 타이트(tight)한 허용치로 범위 질의를 수행하는 계수 제어 (정확한) 해결책을 제안한다. 실제 객체 데이터를 사용하여 실험한 결과, 제안한 두 가지 해결책은 기존 다단계 k-NN 검색에 비해 후보 개수와 검색 시간 모두를 크게 향상시킨 것으로 나타났다.

In this paper, we address the problem of improving the performance of multi-step k-NN search using multi-dimensional indexes. Due to information loss by lower-dimensional transformations, existing multi-step k-NN search solutions produce a large tolerance (i.e., a large search range), and thus, incur a large number of candidates, which are retrieved by a range query. Those many candidates lead to overwhelming I/O and CPU overheads in the postprocessing step. To overcome this problem, we propose two efficient solutions that improve the search performance by reducing the tolerance of a range query, and accordingly, reducing the number of candidates. First, we propose a tolerance reduction-based (approximate) solution that forcibly decreases the tolerance, which is determined by a k-NN query on the index, by the average ratio of high- and low-dimensional distances. Second, we propose a coefficient control-based (exact) solution that uses c k instead of k in a k-NN query to obtain a tigher tolerance and performs a range query using this tigher tolerance. Experimental results show that the proposed solutions significantly reduce the number of candidates, and accordingly, improve the search performance in comparison with the existing multi-step k-NN solution.

키워드

과제정보

연구 과제 주관 기관 : 한국연구재단

참고문헌

  1. F. Korn, N. Sidiropoulos, C. Faloutsos, E. Siegel, and Z. Protopapas, "Fast Nearest Neighbor Search in Medical Image Databases," Proc. of the 22nd Int'l Conference on Very Large Data Bases, Bombay, India, pp. 215-226, Sept. 1996.
  2. R. Agrawal, C. Faloutsos, and A. Swami, "Efficient Similarity Search in Sequence Databases," Proc. of the 4th Int'l Conf. on Foundations of Data Organization and Algorithms, Chicago, Illinois, pp. 69-84, Oct. 1993.
  3. Course of dimensionality. Encyclopedia of Machine Learning, pp. 257-258, Springer, 2010.
  4. Y.-S. Moon, K.-Y. Whang, and W.-S. Han, "General Match: A Subsequence Matching Method in Time-Series Databases Based on Generalized Windows," Proc. of Int'l Conf. on Management of Data, ACM SIGMOD, Madison, Wisconsin, pp. 382-393, Jun. 2002.
  5. Y.-S. Moon, K.-Y. Whang, and W.-K. Loh, "Duality-Based Subsequence Matching in Time-Series Databases," Proc. of the 17th Int'l Conf. on Data Engineering, IEEE ICDE, Heidelberg, Germany, pp. 263-272, Apr. 2001.
  6. C. Faloutsos, M. Ranganathan, and Y. Manolopoulos, "Fast Subsequence Matching in Time-Series Databases," Proc. of Int'l Conf. on Management of Data, ACM SIGMOD, Minneapolis, Minnesota, pp. 419-429, May. 1994.
  7. Y. Tao, K. Yi, C. Sheng, and P. Kalnis, "Quality and Efficiency in High Dimensional Nearest Neighbor Search," Proc. of Int'l Conf. on Management of Data, ACM SIGMOD, Providence, Rhode Island, USA, pp. 563-575, Jun./Jul. 2009.
  8. T. Seidl and H. P. Kriegel, "Optimal Multi-Step k-Nearest Neighbor Search," Proc. of Int'l Conf. on Management of Data, ACM SIGMOD, Seattle, Washington, pp. 154-165, Jun. 1998.
  9. T. Seidl, Adaptable Similarity Search in 3-D Spatial Database Systems, Herbert Utz Verlag, 1998.
  10. S. C. Chapra, Numerical Methods for Engineers, 6th Ed., McGraw-Hill Science, 2010.
  11. B. G. Samuel and J. S. Neil, Using SPSS for Windows and Macintosh: Analyzing and Understanding Data, 6th Ed., Pearson College Div, 2010.
  12. B.-S. Kim, Y.-S. Moon, and J. Kim, "Noise Control Boundary Image Matching Using Time-Series Moving Average Transform," Journal of KIISE: Database, Vol. 36, No. 4, pp. 327-340, Aug. 2009. (in Korean)
  13. Y.-S. Moon, B.-S. Kim, M. S. Kim, and K.-Y. Whang, "Scaling-Invariant Boundary Image Matching Using Time-Series Matching Techniques," Data & Knowledge Engineering, Vol. 69, No. 10, pp. 1022-1042, Oct. 2010. https://doi.org/10.1016/j.datak.2010.07.001
  14. National climatic data center, [Online]. Available: http://www.ncdc.noaa.gov. (downloaded 2013 Mar. 9)
  15. N. Beckmann, H.-P. Kriegel, R. Schneider, and B. Seeger, "The R*-tree: An Efficient and Robust Access Method for Points and Rectangles," Proc. of Int'l Conf. on Management of Data, ACM SIGMOD, Atlantic City, New Jersey, pp. 322-331, May 1990.
  16. W.-S. Han, J. Lee, Y.-S. Moon, and H. Jiang, "Ranked Subsequence Matching in Time-Series Databases," Proc. of the 33rd Int'l Conf. on Very Large Data Bases, Vienna, Austria, pp. 423-434, Sept. 2007.
  17. G. Roh, J. Roh, S. Hwang, and B. Yi, "Supporting Pattern Matching Queries over Trajectories on Road Networks," IEEE Trans. on Knowledge and Data Engineering, Vol. 23, No. 11, pp. 1753-1758, Nov. 2011. https://doi.org/10.1109/TKDE.2010.189
  18. Y. Zhu and D. Shasha, "Warping Indexes with Envelope Transforms for Query by Humming," Proc. of Int'l Conf. on Management of Data, ACM SIGMOD, San Diego, California, pp. 181-192, Jun. 2003.
  19. K.-P. Chan, A. W.-C. Fu, and C. T. Yu, "Harr Wavelets for Efficient Similarity Search of Time-Series: With and Without Time Warping," IEEE Trans. on Knowledge and Data Engineering, Vol. 15, No. 3, pp. 686-705, Jan./Feb. 2003. https://doi.org/10.1109/TKDE.2003.1198399
  20. Y.-S. Moon and J. Kim, "Efficient Moving Average Transform-Based Subsequence Matching Algorithms in Time-Series Databases," Information Sciences, Vol. 177, No. 23, pp. 5415-5431, Dec. 2007. https://doi.org/10.1016/j.ins.2007.05.038