[KSCI] Korea Science Citation Index Service

Estimation of Substring Selectivity in Biological Sequence Database

배진욱 (서울대학교 전기·컴퓨터공학부)
이석호 (서울대학교 컴퓨터공학부)

Publication Information

Journal of KIISE:Databases / v.30, no.2, 2003 , pp. 168-175 More about this Journal

Abstract

Until now, substring selectivities have been estimated by two steps. First step is to build up a count-suffix tree, which has statistical information about substrings, and second step is to estimate substring selectivity using it. However, it's actually impossible to build up a count-suffix tree from biological sequences because their lengths are too long. So, this paper proposes a novel data structure, count q-gram tree, consisting of fixed length substrings. The Count q-gram tree retains the exact counts of all substrings whose lengths are equal to or less than q and this tree is generated in 0(N) time and in site not subject to total length of all sequences, N. This paper also presents an estimation technique, k-MO. k-MO can choose overlapping length of splitted substrings from a query string, and this choice will affect accuracy of selectivity and query processing time. Experiments show k-MO can estimate very accurately.

Keywords

Biological Sequence Database; DNA Sequence; Estimation of Substring Selectivity; Count Suffix Tree; Count Q-gram Tree; k-MO;

Citations & Related Records

Reference

1	L. Gravano, P. G. Ipeirotis, H. V. Jagadish, N. Koudas, S. Muthukrishnan, and D. Srivastava. Approximate string joins in a database (almost) for free. In Proceedings of VLDB, 2001
2	E. Hunt, M. P. Atkinson, and R. W. Irving. A database index to large biological sequences. In Proceedings of VLDB, 2001
3	R. Grossi and J. S. Vitter. Compressed suffix arrays and suffix trees with applications to text indexing and string matching. ACM Symposium on Theory of Computing, pages 397-456, 2000 DOI
4	ftp://ncbi.nlm.nih.gov/genomes/H_sapiens/, 2001
5	P. Krishnan, J. S. Vitter, and B. Iyer. Estimating alphanumeric selectivity in the presence' of wildcards. In Proceedings of ACM SIGMOD, pages 282-293, 1996 DOI
6	H. V. Jagadish, R. T. Ng, and D. Srivastava. Substring selectivity estimation. In Proceedings of ACM Symposium on Principles of Database Systems, June 1999
7	P. J. Haas, J. F. Naughton, S. Seshadri, and L. Stokes. Sampling-based estimation of the number of distinct values of an attribute. In Proceedings of VLDB, pages 311-322, 1995
8	H. V. Jagadish, O. Kapitskaia, R. T. Ng, and D. Srivastava. Multi-dimensional substring selectivity estimation. In Proceedings of VLDB, 1999 DOI
9	Z. Chen, H. V. Jagadish, F. Korn, N. Koudas, S. Muthukrishnan, R. T. Ng, and D. Srivastava. Counting twig matches in a tree. In Proceedings of ICDE, pages 595-604. 2001 DOI
10	A. Aboulnaga, A. R. Alarneldecn, and J. F. Naughton. Estimating the selectivity of XML path expressions for internet scale applications. In Proceedings of VLDB, pages 591-600, 2001
11	V. Poosala, Y. E. Ioannidis, P. J. Haas, and E. J. Shekita, Improved histograms for selectivity estimation of range predicates. In Proceedings of ACM SIGMOD, 1996 DOI
12	G. Navarro and R. Baeza-Yates, A practical q-gram index for text retrieval allowing errors. CLEL Electronic Journal, Volume 1. Number 2, 1998
13	E. M. McCreight. A space-economical suffix tree construction algorithm. Journal of the ACM, Volume 23, pages 262-272. 1976 DOI ScienceOn
14	H. V. Jagadish, H. T. Ng, and D. Srivastava. On effective multi-dimensional indexing for strings. In Proceedings of ACM SIGMOD, pages 403-414, 2000
15	P. Fcrragina, N. Koudas, S. Muthukrishnan, and D. Srivastava. Two-dimensional substring indexing. In Proceedings of ACM Symposium on Principles of Database Systems, 2001 DOI
16	G. Navarro, E. Sutinen, J. Tannincn, and J. Tarhio. Indexing text with approximate q grams. In Proceedings of Combinatorial Pattern Matching, pages 350-363, 2000
17	E. Ukkonen. Approximate string-matching with q-grams and maximal matches. Theoretical Computer Science, Volume 92, pages 192-211, 1992 DOI ScienceOn

KSCI

Estimation of Substring Selectivity in Biological Sequence Database 생물학 서열 데이타베이스에서 부분 문자열의 선적도 추정

Estimation of Substring Selectivity in Biological Sequence Database