시퀀스 데이타베이스에서 타임 워핑을 지원하는 효과적인 유살 검색 기법

An Effective Similarity Search Technique supporting Time Warping in Sequence Databases

  • 김상욱 (강원대학교 컴퓨터정보통신공학부) ;
  • 박상현 (IBM T,J Watson Reserch Center 연구원)
  • 발행 : 2001.12.01

초록

본 논문에서는 대형 시퀀스 데이타베이스에서 타임 워핑을 지원하는 유사 검색을 효과적으로 처리하는 방안에 관하여 논의한다 타임 워핑은 시퀀스의 길이가 서로 다른 경우에도 유사한 패턴을 갖는 시퀀스들을 찾을 수 있도록 해 준다. 타임 워핑 거리는 삼각형 부등식 성질을 만족하지 못하므로 기존의 기법들은 착오 기각(false dismissal) 없이 다차원인덱스를 사용할 수 없었다. 이러한 기법들은 전체 데이타베이스를 스캔해야 하므로 대형 데이타베이스에서는 심각한 성능 저하의 문제를 가진다. 서픽스 트리를 사용하는 또 다른 기법은 큰 트리로 인한 성능상의 문제를 갖는다 본 논문에서는 타임 워핑을 지원하는 효과적인 유사 검색 기법을 제안한다. 제안된 기법의 주요 목표는 착오 기각 없이 대형 데이타베이스에서도 좋은 검색 성능을 보장하는 것이다. 이러한 목표를 위하여 본 연구에서는 삼각형 부등식을 만족하는 타임 워핑 거리의 새로운 하한 거리 함수 $D_{tw-Ib}를 고안한다. D_{tw-Ib}$는 각 시퀀스로부터 타임 워핑과 무관한 4-터플 특성 벡터를 추출한 다. 제안된 기법에서는 이러한 4-터플 특성 벡터를 인덱싱 애트리뷰트로 사용하는 다차원 인덱스를 기반으로 유사 검색을 효율적으로 처리한다. 본 논문에서는 제안된 기법에서 착오 기각이 발생하지 않음을 증명한다. 또한, 제안된 기법의 우수성을 규명하기 위하여 다양한 실험을 수행한다. 실험 결과에 의하면 제안된 기법은 기존의 기법들과 비교하여 실제 S&P 500 주식 데이타에 대하여 43배, 대형 생성 데이타에 대하여 720배가지 의 성능 개선 효과를 가지는 것으로 나타났다.

This paper discusses an effective processing of similarity search that supports time warping in large sequence database. Time warping enables finding sequences with similar patterns even when they are of different length, Previous methods fail to employ multi-dimensional indexes without false dismissal since the time warping distance does not satisfy the triangular inequality. They have to scan all the database, thus suffer from serious performance degradation in large database. Another method that hires the suffix tree also shows poor performance due to the large tree size. In this paper we propose a new novel method for similarity search that supports time warping Our primary goal is to innovate on search performance in large database without false dismissal. to attain this goal ,we devise a new distance function $D_{tw-Ib}$ consistently underestimates the time warping distance and also satisfies the triangular inequality, $D_{tw-Ib}$ uses a 4-tuple feature vector extracted from each sequence and is invariant to time warping, For efficient processing, we employ a distance function, We prove that our method does not incur false dismissal. To verify the superiority of our method, we perform extensive experiments . The results reveal that our method achieves significant speedup up to 43 times with real-world S&P 500 stock data and up to 720 times with very large synthetic data.

키워드

참고문헌

  1. R. Agrawal, C. Faloutsos, and A. Swami, 'Efficient Similarity Search in Sequence Databases,' In Proc. Int'l. Conference on Foundations of Data Organization and Algorithms, FODO, pp. 69-84, Oct. 1993
  2. R. Agrawal et aI., 'Fast Similarity Search in the Presence of Noise, Scaling, and Translation in Time-Series Databases,' In Proc. Int'l Conference on Very Large Data Bases, VLDB, pp, 490-501, Sept. 1995
  3. C. Faloutsos, M. Ranganathan, and Y. Manolopoulos, 'Fast Subsequence Matching in Timeseries Databases,' In Proc. lnt'l. Conf. on Management of Data, ACM SIGMOD, pp. 419-429, May 1994 https://doi.org/10.1145/191839.191925
  4. Chen, M. S., Han, J., and Yu, P. S .. 'Data Mining: An Overview from Database Perspective,' IEEE Trans. on Knowledge and Data Engineering, Vol. 8, No.6, pp. 866-883, 1996 https://doi.org/10.1109/69.553155
  5. D. Rafiei and A. Mendelzon, 'Similarity-Based Queries for Time-Series Data,' In Proc. Int'l Conf. on Management of Data, ACM SIGMOD, pp. 13-24, 1997 https://doi.org/10.1145/253260.253264
  6. K. K. W. Chu, and M. H. Wong, 'Fast TimeSeries Searching with Scaling and Shifting,' In Proc. Int'l. Symp. on Principles of Database Systems, ACM PODS, pp. 237-248, May 1999 https://doi.org/10.1145/303976.304000
  7. D. Q. Goldin and P. C. Kanellakis, 'On Similarity Queries for Time-Series Data: Constraint Specification and Implementation,' In Proc. Int'l. Conf. on Principles and Practice qf Constraint Programming, CP, pp. 137-153, Sept. 1995 https://doi.org/10.1007/3-540-60299-2_9
  8. G. Das, D. Gunopulos, and H. Mannila, 'Finding Similar Time Series,' In Proc. European Symp. on Principles of Data Mining and Knowledge Discovery, PKDD, pp. 88-100, 1997 https://doi.org/10.1007/3-540-63223-9_109
  9. W. K. Loh, S. W. Kim, and K. Y. Whang, 'Index Interpolation: A Subsequence Matching Algorithm Supporting Moving Average Transform of Arbitrary Order in Time-Series Databases,' IEICE Trans. on Information and Systems, 2000. (accepted to appear)
  10. W. K. Loh, S. W. Kim, and K. Y. Whang, 'Index Interpolation: An Approach for Subsequence Matching Supporting Normalization Transform in Time-Series Databases, 2000. (submitted for publication)
  11. D. J. Berndt and J. Clifford, 'Finding Patterns in Time Series: A Dynamic Programming Approach,' Advances in Knowledge Discovery and Data Mining, pp. 229-248, 1996
  12. B. K. Yi, H. V. Iagadish, and C. Faloutos, 'Efficient Retrieval of Similar Time Sequences Under Time Warping,' In Proc. Int'l. Conf. on Data Engineering, IEEE, pp. 201-208, 1998
  13. S. H. Park et al., 'Efficient Searches for Similar Subsequences of Difference Lengths in Sequence Databases,' In Proc. Int'l. Conf. on Data Engineering, IEEE, pp. 23-32, 2000
  14. L. Rabiner and H. H. Juang, Fundamentals of Speech Recognition, Prentice Hall, 1993
  15. N. Beckmann et al., 'The $R^*$-tree: An Efficient and Robust Access Method for Points and Rectangles,' In Proc. Int'l. Conf. on Management of Data, ACM SIGMOD, pp. 322-331, May 1990 https://doi.org/10.1145/93597.98741
  16. S. Berchtold, D. A. Keim, and H.P. Kriegel, 'The X -tree: An Index Structure for High-Dimensional Data,' In Proc lnt'l. Coni. on Very Large Data Bases, VLDB, pp. 28-39, 1996
  17. T. K. Sellis, N. Roussopoulos, and C. Faloutsos, $R^{*}$-Tree: A Dynamic Index for Multi-Dimensional Objects,' In Proc. Int'l. Coni. on Very Large Data Bases, VLDB, 507-518, 1987
  18. F. P. Preparata and M. Sharnos, Computational Geometry: An Introduction, Springer-Verlag, 1985
  19. C. Faloutsos and K. I. Lin, 'Fastlvlap: A Fast Algorithm for Indexing, Data-Mining and Visualization of Traditional and Multimedia Datasets,' In Proc. Int'l. Conf. on Management of Data, ACM SIGMOD, pp. 163-174, 1995 https://doi.org/10.1145/223784.223812
  20. G. A. Stephen, String Searching Algorithms, World Scientific Publishing, 1994
  21. K. S. Shim, R. Srikant, and R. Agrawal, 'Highdimensional Similarity Joins,' In Proc. Int'l. Conf. on Data Engineering, IEEE, pp, 301-311, Apr. 1997 https://doi.org/10.1109/ICDE.1997.581814
  22. A. Guttman, 'R-Trees: A Dynamic Index Structure for Spatial Searching,' In Proc. Int'l. Conf. on Mangement of Data, ACM SIGMOD, pp. 47-57, 1984 https://doi.org/10.1145/602259.602266
  23. J. Bercken, B. Seeger, and P. Widmayer, 'A General Approach to Bulk Loading Multidimensional Index Structures,' In Proc. Int'l. Conf. on Very Large Data Bases, VDLB, pp. 406-415, 1997
  24. I. Kamel and C. Faloutsos, 'On Packing Rr trees,' In Proc. Int'l. Conf. on Irdormation and Knowledge Management, ACM CIKM, pp. 490-499, 1993 https://doi.org/10.1145/170088.170403
  25. S. T. Leutenegger, J. M. Edgington, and M. A. Lopez, 'STH: A Simple and Efficient Algorithm for R-Tree Packing,' In Proc. Int'l. Conf. on Data Engineering, IEEE, pp. 497-506, 1997 https://doi.org/10.1109/ICDE.1997.582015