Generalization of Window Construction for Subsequence Matching in Time-Series Databases

시계열 데이터베이스에서의 서브시퀀스 매칭을 위한 윈도우 구성의 일반화

  • Moon, Yang-Sae (Dept. of Electronic Computer Science, Korea Advanced Institute of Science and Technology) ;
  • Han, Wook-Shin (Dept. of Electronic Computer Science, Korea Advanced Institute of Science and Technology) ;
  • Whang, Kyu-Young (Dept. of Electronic Computer Science, Korea Advanced Institute of Science and Technology)
  • 문양세 (한국과학기술원 전자전산학과) ;
  • 한욱신 (한국과학기술원 전자전산학과) ;
  • 황규영 (한국과학기술원 전자전산학과)
  • Published : 2001.09.01

Abstract

In this paper, we present the concept of generalization in constructing windows for subsequence matching and propose a new subsequence matching method. GeneralMatch, based on the generalization. The earlier work of Faloutsos et al.(FRM in short) causes a lot of false alarms due to lack of the point-filtering effect. DualMatch, which has been proposed by the authors, improves performance significantly over FRM by exploiting the point filtering effect, but it has the problem of having a smaller maximum window size (half that FRM) given the minimum query length. GeneralMatch, an improvement of DualMatch, offers advantages of both methods: it can use large windows like FRM and, at the same time, can exploit the point-filtering effect like DualMatch. GeneralMatch divides data sequences into J-sliding windows (generalized sliding windows) and the query sequence into J-disjoint windows (generalized disjoint windows). We formally prove that our GeneralMatch is correct, i.e., it incurs no false dismissal. We also prove that, given the minimum query length, there is a maximum bound of the window size to guarantee correctness of GeneralMatch. We then propose a method of determining the value of J that minimizes the number of page accesses, Experimental results for real stock data show that, for low selectivities ($10^{-6}~10^{-4}$), GeneralMatch improves performance by 114% over DualMatch and by 998% iver FRM on the average; for high selectivities ($10^{-6}~10^{-4}$), by 46% over DualMatch and by 65% over FRM on the average.

본 논문에서는 서브시퀀스 매칭에서 윈도우 구성의 일반화 개념을 제안하고, 이에 기반한 새로운 서브시퀀스 매칭 방법인 GeneralMatch를 제안한다. 기존 연구인 Faloutsos 등의 방법 (간단히 FRM이라 한다)은 점 여과 효과의 결여로 인해 많은 착오해답을 발생시켰다. 본 저자들의 DualMatch는 점 여과 효과를 발휘하여 성능을 크게 향상시켰으나, 주어진 최소 질의 시퀀스 길이에 대해 최대 윈도우 크기가 작은(FRM의 1/2) 문제가 있었다. GeneralMatch는 DualMatch를 더욱 개선한 방법으로서, 두 방법의 장점을 모두 취한다. 즉, FRM과 같이 큰 윈우를 사용할 수 있으며, 동시에 DualMatch와 같이 점 여과 효과를 발휘할 수 있다. GeneralMatch는 데이터 시퀀스를 J-슬라이딩 윈도우(일반화된 슬라이딩 윈도우)로 나누고, 질의 시퀀스를 J-디스조인트 윈도우(일반화된 디스조인트 윈도우)로 나누는 방법을 사용한다. 본 논문에서는 GerneralMatch의 정확성, 즉 GeneralMatch가 착오기각이 발생하지 않음을 증명한다. 또한, 주어진 최소 질의 시퀀스 길이에 대해 GeneralMatch가 바르게 동작하기 위한 최대 윈도우 크기가 있음을 증명한다. 그리고, 페이지 액세스 횟수를 최소로 하는 J 값의 결정 방법을 제안하다. 실제 주식 데이터에 대한 실험 결과, GeneralMatch는 낮은 선택률 범위($10^{-6}~10^{-4}$)에서 DualMatch에 비해 평균 114%, FRM에 비해 998% 성능을 향상시켰으며, 높은 선택률 범위($10^{-6}~10^{-4}$)에서도 DualMatch에 비해 평균 46%, FRM에 비해 평균 65% 성능을 향상시켰다.

Keywords

References

  1. Agrawal, R., Faloutsos, C., and Swami, A., 'Efficient Similarity Search in Sequence Databases,' In Proc. the 4th Int'l Conf. on Foundations of Data Organization and Algorithms, Chicago,Illinois, pp. 69-84, Oct. 1993
  2. Faloutsos, C., Ranganathan, M., and Manolopoulos, Y., 'Fast Subsequence Matching in Time-Series Databases,' In Proc. Int'l Conf. on Management of Data, ACM SIGMOD, Minneapolis, Minnesota, pp. 419-429, May 1994 https://doi.org/10.1145/191839.191925
  3. Chan, K.-P. and Fu, A. W.-C., 'Efficient Time Series Matching by Wavelets,' In Proc. the 15th Int'l Conf. on Data Engineering(ICDE), IEEE, Sydney, Australia, pp. 126-133, Feb. 1999 https://doi.org/10.1109/ICDE.1999.754915
  4. Moon, Y.-S., Whang, K.-Y., and Loh, W.-K., 'Duality-Based Subsequence Matching in Time-Series Databases,' In Proc. the 17th Int'l Conf. on Data Engineering(ICDE), IEEE, Heidelberg, Germany, pp. 263-272, April 2001 https://doi.org/10.1109/ICDE.2001.914837
  5. Chu, K. W. and Wong, M. H., 'Fast Time-Series Searching with Scaling and Shifting,' In Proc. the 15th ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, Philadelphia, Pennsylvania, pp. 237-248 June 1999 https://doi.org/10.1145/303976.304000
  6. Yi, B.-K., Jagadish, H. V., and Faloutsos, C., 'Efficient Retrieval of Similar Time Sequences Under Time Warping,' In Proc. the 14th Int'l Conf. on Data Engineering(ICDE), IEEE, Orlando, Florida, pp. 201-208, Feb. 1998 https://doi.org/10.1109/ICDE.1998.655778
  7. Agrawal, R., Lin, K.-I., Sawhney, H. S., and Shim, K., 'Fast Similarity Search in the Presence of Noise, Scaling, and Translation in Time-Series Databases,' In Proc. the 21st Int'l Conf. on Very Large Data Base, Zurich, Switzerland, pp. 490-501, Sept. 1995
  8. Jagadish, H. V., Mendelzon, A. O., and Milo, T., 'Similarity-Based Queries,' In Proc. the 14th ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, San Jose, Clifornia, pp. 36-45, May 1995
  9. Park, S., Chu, W. W., Yoon, J., and Hsu, C., 'Efficient Searches for Similar Subsequences of Different Lengths in Sequence Databases,' In Proc. the 16th Int'l Conf. on Data Engineering (ICDE), IEEE, San Diego, California, pp. 23-32, March 2000 https://doi.org/10.1109/ICDE.2000.839384
  10. Rafiei, D. and Mendelzon, A., 'Similarity-Based Queries for Time Series Data,' In Proc. Int'l Conf. on Management of Data, ACM SIGMOD, Tucson, Arizona, pp. 13-25, May 1997 https://doi.org/10.1145/253262.253264
  11. Rafiei, D., 'On Similarity-Based Queries for Time Series Data,' In Proc. the 15th Int'l Conf. on Data Engineering(ICDE), IEEE, Sydney, Australia, pp. 410-417, Feb. 1999 https://doi.org/10.1109/ICDE.1999.754957
  12. Beckmann, N., Kriegel, H.-P., Schneider, R., and Seeger, B., 'The R-tree: An Efficient and Robust Access Method for Points and Rectangles,' In Proc. Int'l Conf. on Management of Data, ACM SIGMOD, Atlantic City, New Jersey, pp. 322-331, May 1990 https://doi.org/10.1145/93597.98741
  13. Berchtold, S., Bohm, C., and Kriegel, H.-P., 'The Pyramid-Technique: Towards Breaking the Curse of Dimensionality,' In Proc. Int'l Conf. on Management of Data, ACM SIGMOD, Seattle, Washington, pp. 142-153, June 1998 https://doi.org/10.1145/276305.276318
  14. Guttman, A., 'R-trees: A Dynamic Index Structure for Spatial Searching,' In Proc. Int'l Conf. on Management of Data, ACM SIGMOD, Boston, Massachusetts, pp. 47-57, June 1984 https://doi.org/10.1145/602259.602266
  15. Seeger, B. and Kriegel, H.-P., 'The Buddy-Tree: An Efficient and Robust Access Method for Spatial Data Base Systems,' In Proc. the 16th Int'l Conf. on Very Large Data Bases, Brisbane, Queensland, Australia, pp. 590-601, Aug. 1990
  16. Whang, K.-Y. and Krishnamurthy, R., Multilevel Grid Files, IBM Research Report RC11516, IBM Thomas J. Watson Research Center, Yorktown Heights, New York, Nov. 1985
  17. Whang, K.-Y., Kim, S.-W., and Wiederhold, G., 'Dynamic Maintenance of Data Distribution for Selectivity Estimation,' The VLDB Journal, Vol. 3, No. 1, pp. 29-51, Jan. 1994 https://doi.org/10.1007/BF01231357
  18. Faloutsos, C. and Kamel, I., 'Beyond Uniformity and Independence: Analysis of R-trees Using the Concept of Fractal Dimension,' In Proc. the 13th ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, Minneapolis, Minnesota, pp. 4-13, May 1994 https://doi.org/10.1145/182591.182593
  19. Rafiei, D. and Mendelzon, A., 'Efficient Retrieval of Similarity Time Sequences Using DFT,' In Proc. Int'l Conf. on Foundations of Data Organization, Kobe, Japan, pp. 249-257, Nov. 1998
  20. Proc. Int'l Conf. on Foundations of Data Organization Efficient Retrieval of Similarity Time Sequences Using DFT Rafiei,D.;Mendelzon,A.