The Performance Bottleneck of Subsequence Matching in Time-Series Databases: Observation, Solution, and Performance Evaluation

시계열 데이타베이스에서 서브시퀀스 매칭의 성능 병목 : 관찰, 해결 방안, 성능 평가

  • 김상욱 (한양대학교 정보통신대학 정보통신학부)
  • Published : 2003.08.01

Abstract

Subsequence matching is an operation that finds subsequences whose changing patterns are similar to a given query sequence from time-series databases. This paper points out the performance bottleneck in subsequence matching, and then proposes an effective method that improves the performance of entire subsequence matching significantly by resolving the performance bottleneck. First, we analyze the disk access and CPU processing times required during the index searching and post processing steps through preliminary experiments. Based on their results, we show that the post processing step is the main performance bottleneck in subsequence matching, and them claim that its optimization is a crucial issue overlooked in previous approaches. In order to resolve the performance bottleneck, we propose a simple but quite effective method that processes the post processing step in the optimal way. By rearranging the order of candidate subsequences to be compared with a query sequence, our method completely eliminates the redundancy of disk accesses and CPU processing occurred in the post processing step. We formally prove that our method is optimal and also does not incur any false dismissal. We show the effectiveness of our method by extensive experiments. The results show that our method achieves significant speed-up in the post processing step 3.91 to 9.42 times when using a data set of real-world stock sequences and 4.97 to 5.61 times when using data sets of a large volume of synthetic sequences. Also, the results show that our method reduces the weight of the post processing step in entire subsequence matching from about 90% to less than 70%. This implies that our method successfully resolves th performance bottleneck in subsequence matching. As a result, our method provides excellent performance in entire subsequence matching. The experimental results reveal that it is 3.05 to 5.60 times faster when using a data set of real-world stock sequences and 3.68 to 4.21 times faster when using data sets of a large volume of synthetic sequences compared with the previous one.

서브시퀀스 매칭은 주어진 질의 시퀀스와 변화의 추세가 유사한 서브시퀀스들을 시계열 데이타베이스로부터 검색하는 연산이다. 본 논문에서는 서브시퀀스 매칭 처리의 성능 병목을 파악하고, 이를 해결함으로써 전체 서브시퀀스 매칭의 성능을 크게 개선하는 방안에 관하여 논의한다. 먼저, 사전 실험을 통하여 전체 서브시퀀스 매칭의 처리 시간 중 인덱스 검색 단계와 후처리 단계에서 디스크 액세스 시간 및 CPU 처리 시간이 차지하는 비중을 분석한다. 이를 바탕으로 후처리 단계가 서브시퀀스 매칭의 성능 병목이며, 후처리 단계의 최적화가 기존의 서브시퀀스 매칭 기법들이 간과한 매우 중요한 이슈임을 지적한다. 이러한 서브시퀀스 매칭의 성능 병목을 해결하기 위하여 후처리 단계를 최적으로 처리할 수 있는 간단하면서도 매우 효과적인 기법을 제안한다. 제안된 기법은 후처리 단계에서 후보 서브시퀀스들이 질의 시퀀스와 실제로 유사한가를 판단하는 순서를 조정함으로써 기존의 후처리 단계의 처리에서 발생하는 많은 디스크 액세스의 중복과 CPU 처리의 중복을 완전히 제거한 수 있다 제안된 기법이 착오 기각을 발생시키지 않음과 후처리 단계를 처리하기 위한 최적의 기법임을 이론적으로 증명한다. 또한, 실제 데이타와 생성 데이타를 이용한 다양한 실험들을 통하여 제안된 기법의 성능 개선 효과를 정량적으로 검증한다. 실험 결과에 의하면, 제안된 기법은 기존 기법의 후처리 단계 수행 시간을 실제 주식 데이타를 이용한 실험의 경우 ,3.91 배에서 9.42배까지, 대규모의 생성 데이터를 이용한 실험의 경우 4.97 배에서 5.61배까지 개선시키는 것으로 나타났다. 또한, 제안된 기법을 채택함으로써 전체 서브시퀀스 매칭 처리 시간의 90%에 이르던 후처리 단계의 비중을 70%이하로 내릴 수 있었다. 이것은 제안된 기법이 서브시퀀스 매칭의 성능 병목을 성공적으로 해결하였음을 보여주는 것이다. 이 견과, 제안된 기법은 전체 서브시퀀tm 매칭의 성능을 실제 주식 데이타를 사용한 실험의 경우 3.05 배에서 5.60 배까지, 대규모의 생성 데이타를 이용한 실험의 경우 3.68 배에서 4.21 배까지 개선시킬 수 있었다.

Keywords

References

  1. R. Agrawal, C. Faloutsos, and A. Swami, 'Efficient Similarity Search in Sequence Data bases,' In Proc. Int'l. Conf. on Foundations of Data Organization and Algorithms, FODO, pp. 69-84, Oct 1993
  2. C. Chatfield, The Analysis of Time Series: An Introduction, 3rd Edition, Chapman and Hall, pp. 69-84, 1984
  3. R. Agrawal et al., 'Fast Similarity Search in the Presence of Noise, Scaling, and Translation in Time Series Databases,' In Proc. Int'l. Conf. on Very Large Data Bases, VLDB, pp. 490-501, Sept. 1995
  4. C. Faloutsos, M. Ranganathan, and Y. Mano lopoulos, 'Fast Subsequence Matching in Time series Databases,' In Proc. Int'l. Conf. on Management of Data, ACM SIGMOD, pp. 419-429, May 1994 https://doi.org/10.1145/191843.191925
  5. Y. S. Moon, K. Y. Whang, and W. K. Loh, 'Duality Based Subsequence Matching in Time Series Databases,' iN Proc. Int'l Conf. on Data Engineering, IEEE ICDE, pp. 263-272, 2001 https://doi.org/10.1109/ICDE.2001.914837
  6. Chen, M. S., Han, J., and Yu, P. S., 'Data Mining: An Overview from Database Perspective,' IEEE Trans. on Knowledge and Data Engineering, Vol. 8, No. 6, pp. 866-883, 1996 https://doi.org/10.1109/69.553155
  7. D. Rafiei and A. Mendelzon, 'Similarity Based Queries for Time Series Data,' In Proc. Int'l. Conf. on Management of Data, ACM SIGMOD, pp. 13-24, 1997 https://doi.org/10.1145/253260.253264
  8. K. P. Chan and A. W. C. Fu, 'Efficient Time Series Matching by Wavelets,' In Proc. Int'l. Conf. on Data Engineering, IEEE ICDE, pp. 126-133, 1999 https://doi.org/10.1109/ICDE.1999.754915
  9. K. K. W. Chu, and M. H. Wong, 'Fast Time Series Searching with Scaling and Shifting,' In Proc. Int'l. Symp. on Principles of Database Systems, ACM PODS, pp. 237-248, May 1999 https://doi.org/10.1145/303976.304000
  10. D. Q. Goldin and P. C. Kanellakis, 'On Similarity Queries for Time Series Data: Constraint Specification and Implementation,' In Proc. Int'l. Conf. on Principles and Practice of Constraint Programming, CP, pp. 137-153, Sept. 1995 https://doi.org/10.1007/3-540-60299-2_9
  11. D. Rafiei, 'On Similarity Based Queries for Time Series Data,' In Proc. Int'l. Conf. on Data Engineering, IEEE ICDE, pp. 410-417, 1999
  12. B. K.Yi and C. Faloutsos, 'Fast Time Sequence Indexing for Arbitrary Lp Norms,' In Proc. Int'l. Conf. on Very Large Data Bases, VLDB, pp. 385-394, 2000
  13. D. J. Berndt, and J. Clifford, 'Finding Patterns in Time Series: A Dynamic Programming Approach,' Advances in Knowledge Discovery and Data Mining, pp. 229-248, 1996
  14. B. K. Yi, H. V. Jagadish, and C. Faloutsos, 'Efficient Retrieval of Similar Time Sequences Under Time Warping,' In Proc. Int'l. Conf. on Data Engineering, IEEE ICDE, pp. 201-208, 1998 https://doi.org/10.1109/ICDE.1998.655778
  15. S. H. Park et al., 'Efficient Searches for Similar Subsequences of Difference Lengths in Sequence Databases,' In Proc. Int'l. Conf. on Data Engineering, IEEE ICDE, pp. 23-32, 2000 https://doi.org/10.1109/ICDE.2000.839384
  16. S. W. Kim, S. H. Park, and W. W. Chu, 'An Index Based Approach for Similarity Search Supporting Time Warping in Large Sequence Databases,' In Proc. Int'l. Conf. on Data Engineering, IEEE ICDE, pp. 607-614, 2001 https://doi.org/10.1109/ICDE.2001.914875
  17. N. Beckmann et al., 'The Rtree: An Efficient and Robust Access Method for Points and Rectangles,' In Proc. Int'l. Conf. on Management of Data, ACM SIGMOD, pp. 322-331, May 1990 https://doi.org/10.1145/93597.98741
  18. S. Berchtold, D. A. Keim, and H. P. Kriegel, 'The X tree: An Index Structure for High -Dimensional Data,' In Proc Int'l. Conf. on Very Large Data Bases, VLDB, pp. 28-39, 1996
  19. R. Weber, H. J. Schek, and S. Blott, 'A Quantitative Analysis and Performance Study for Similarity Search Methods in High-Dimensional Spaces,' In Proc. Int'l. Conf. on Very Large Data Bases, VLDB, pp. 194-205, 1998
  20. G. Das, D. Gunopulos, H. Mannila, 'Finding Similar Time Series, 'Proc. European Symp. on Principles of Data Mining and Knowledge Discovery, PKDD, pp. 88-100,1997 https://doi.org/10.1007/3-540-63223-9_109
  21. W. K.L oh, S. W. Kim, and K. Y. Whang, 'Index Interpolation: An Approach for Subsequence Matching Supporting Normalization Transform in Time-Series Databases,' In Proc. ACM Int'l. Conf. on Information and Knowledge Management, ACM CIKM, pp. 480-487, 2000 https://doi.org/10.1145/354756.354856
  22. W. K. Loh, S. W. Kim, and K. Y. Whang, 'Index Interpolation: A Subsequence Matching Algorithm Supporting Moving Average Transform of Arbitrary Order in Time-Series Databases,' IEICE Trans. on Information and Systems, Vol. E84-D, Nol. 1, pp. 76-86, 2001
  23. S. H. Park, S. W. Kim, J. S. Cho, and S. Padmanabhan, 'Prefix-Querying: An Approach for Effective Subsequence Matching Under Time Warping in Sequence Databases,' In Proc. ACM Intl. Conf. on Information and Knowledge Management, ACM CIKM, pp. 255-262, 2001 https://doi.org/10.1145/502585.502629
  24. S. W. Kim et al., Optimal Construction of a Multi-dimensional Index for Efficient Similarity Search, pp. 2002. (unpublished manuscript)
  25. P. G. Selinger et al., 'Access Path Selection in a Relational Database Management System,' In Proc. Int'l. Conf. on Management of Data, ACM SIGMOD, pp. 23-34, May 1979 https://doi.org/10.1145/582095.582099