A Suffix Tree Transform Technique for Substring Selectivity Estimation

부분 문자열 선택도 추정을 위한 서픽스트리 변환 기법

  • 이홍래 (서울대학교 컴퓨터공학과) ;
  • 심규석 (서울대학교 전기 컴퓨터 공학부) ;
  • 김형주 (서울대학교 컴퓨터공학부)
  • Published : 2007.04.15

Abstract

Selectivity estimation has been a crucial component in query optimization in relational databases. While extensive researches have been done on this topic for the predicates of numerical data, only little work has been done for substring predicates. We propose novel suffix tree transform algorithms for this problem. Unlike previous approaches where a full suffix tree is pruned and then an estimation algorithm is employed, we transform a suffix tree into a suffix graph systematically. In our approach, nodes with similar counts are merged while structural information in the original suffix tree is preserved in a controlled manner. We present both an error-bound algorithm and a space-bound algorithm. Experimental results with real life data sets show that our algorithms have lower average relative error than that of the previous works as well as good error distribution characteristics.

선택도 추측은 관계형 데이타베이스에서 질의 최적화의 한 중요한 요소이다. 숫자 데이타에 대한 조건식에 대하여 이 주제는 많은 연구가 되어 왔으나 부분문자열에 대한 조건식은 최근에 이르러서야 관심의 초점이 되고 있다. 우리는 이 논문에서 이 문제를 위한 새로운 서픽스 트리 변환 알고리즘을 제시한다. 제안하는 기법은 서픽스 트리의 노드들을 단순히 잘라 없애 버리기 보다는 기본적으로 비슷한 카운트를 갖는 노드들을 구조적 정보를 유지하면서 병합하여 전체 크기를 줄인다. 본 논문은 여러 제약 사항하에서 서픽스 트리를 그 크기를 줄이도록 변환을 하는 알고리즘을 제시하고 실생활 데이타를 대상으로 실험을 수행하여 우리가 제안하는 알고리즘이 기존의 알고리즘들보다 우수한 평균 상대 에러와 에러 분포 특성을 지니고 있음을 보인다.

Keywords

References

  1. P. Krishnan, Jeffrey Scott Vitter, and Bala Iyer, Estimating Alphanumeric Selectivity in the Presence of Wildcards. In Proceedings of the ACM SIGMOD, 1996 https://doi.org/10.1145/233269.233341
  2. Min Wang, Jeffrey Scott Vitter, and Bala Iyer, Selectivity Estimation in the Presence of Alphanumeric Correlations. In IEEE International Conference on Data Engineering, 1997 https://doi.org/10.1109/ICDE.1997.581750
  3. H. V. Jagadish, Ol Kapitskaia, Raymond T. Ng, and Divesh Srivastava. One dimensional and multidimensional substring selectivity estimation. VLDB journal, 9:214-230, 2000 https://doi.org/10.1007/s007780000029
  4. Zhiyuan Chen, Flip Korn, Nick Koudas, and S. Muthukrishnan. Selectivity Estimation For Boolean Queries.In Proceedings of ACM Symposium on Principles of Database Systems, 2000
  5. E. M. McCreight. A Space-Economical Suffix Tree Construction Algorithm. Journal of the ACM, 15:514-534, 1976 https://doi.org/10.1145/321941.321946
  6. Naresh Neelapala, Romil Mittal, and Iayant R. Haritsa. SPINE: Putting Backbone into String Indexing. In IEEE International Conference on Data Engineering, 2004 https://doi.org/10.1109/ICDE.2004.1320008
  7. Burton H. Bloom. Space/time trade-offs in hash coding with allowable errors. Communications of the ACM, 13(7):422-426, 1970 https://doi.org/10.1145/362686.362692
  8. Li Fan, Pei Cao, Jussara Almeida, and Andrei Z. Broder. Summary Cache: A Scalable Wide-Area Web Cache Sharing Protocol. IEEE/ACM Transaction on Networks, 8(3):281-293, 2000 https://doi.org/10.1109/90.851975
  9. Wei Wang, Haifeng Jiang, Hongjun Lu, and Jeffrey Xu Yu. Bloom Histogram: Path Selectivity Estimation for XML Data with Updates. In Proceedings of the Conference on VLDB, 2004
  10. Kenneth H. Rosen. Elementary Number Theory. Addison-Wesley Longman, Inc., 1988
  11. M. Ley. Dblp. http://www.fnformatick.uni-tier.de/ley/db
  12. L. Lim, Min Wang, S. Padmanabhan, Jeffery Scott Vitter, and R. Parr. XPathLearner: An on-line self-tuning Markov histogram for XML path selectivity estimation. In Proceedings of the Conference on VLDB, 2002
  13. Surajit Chaudhuri, Venkatesh Ganti, and Luis Gravano. Selectivity Estimation for String Predicates: Overcoming the Underestimation Problem. In IEEE International Conference on Data Engineering, 2004 https://doi.org/10.1109/ICDE.2004.1319999
  14. H. V. Jagadish, Olga Kapitskaia, Raymond T. Ng, and Divesh Srivastava. Muti-Dimensional Substring Selectivity Estimation. In Proceedings of the Conference on Very Large Data Bases, 1999
  15. Zhiyuan Chen, H. V. Jagadish, Flip Korn, Nick Koudas, S. Muthukrishnan, Raymond T. Ng, and Divesh Srivastava. Counting Twig Matches in a Tree. In IEEE International Conference on Data Engineering, 2001 https://doi.org/10.1109/ICDE.2001.914874