Browse > Article

Estimation of Substring Selectivity in Biological Sequence Database  

배진욱 (서울대학교 전기·컴퓨터공학부)
이석호 (서울대학교 컴퓨터공학부)
Abstract
Until now, substring selectivities have been estimated by two steps. First step is to build up a count-suffix tree, which has statistical information about substrings, and second step is to estimate substring selectivity using it. However, it's actually impossible to build up a count-suffix tree from biological sequences because their lengths are too long. So, this paper proposes a novel data structure, count q-gram tree, consisting of fixed length substrings. The Count q-gram tree retains the exact counts of all substrings whose lengths are equal to or less than q and this tree is generated in 0(N) time and in site not subject to total length of all sequences, N. This paper also presents an estimation technique, k-MO. k-MO can choose overlapping length of splitted substrings from a query string, and this choice will affect accuracy of selectivity and query processing time. Experiments show k-MO can estimate very accurately.
Keywords
Biological Sequence Database; DNA Sequence; Estimation of Substring Selectivity; Count Suffix Tree; Count Q-gram Tree; k-MO;
Citations & Related Records
연도 인용수 순위
  • Reference
1 L. Gravano, P. G. Ipeirotis, H. V. Jagadish, N. Koudas, S. Muthukrishnan, and D. Srivastava. Approximate string joins in a database (almost) for free. In Proceedings of VLDB, 2001
2 E. Hunt, M. P. Atkinson, and R. W. Irving. A database index to large biological sequences. In Proceedings of VLDB, 2001
3 R. Grossi and J. S. Vitter. Compressed suffix arrays and suffix trees with applications to text indexing and string matching. ACM Symposium on Theory of Computing, pages 397-456, 2000   DOI
4 ftp://ncbi.nlm.nih.gov/genomes/H_sapiens/, 2001
5 P. Krishnan, J. S. Vitter, and B. Iyer. Estimating alphanumeric selectivity in the presence' of wildcards. In Proceedings of ACM SIGMOD, pages 282-293, 1996   DOI
6 H. V. Jagadish, R. T. Ng, and D. Srivastava. Substring selectivity estimation. In Proceedings of ACM Symposium on Principles of Database Systems, June 1999
7 P. J. Haas, J. F. Naughton, S. Seshadri, and L. Stokes. Sampling-based estimation of the number of distinct values of an attribute. In Proceedings of VLDB, pages 311-322, 1995
8 H. V. Jagadish, O. Kapitskaia, R. T. Ng, and D. Srivastava. Multi-dimensional substring selectivity estimation. In Proceedings of VLDB, 1999   DOI
9 Z. Chen, H. V. Jagadish, F. Korn, N. Koudas, S. Muthukrishnan, R. T. Ng, and D. Srivastava. Counting twig matches in a tree. In Proceedings of ICDE, pages 595-604. 2001   DOI
10 A. Aboulnaga, A. R. Alarneldecn, and J. F. Naughton. Estimating the selectivity of XML path expressions for internet scale applications. In Proceedings of VLDB, pages 591-600, 2001
11 V. Poosala, Y. E. Ioannidis, P. J. Haas, and E. J. Shekita, Improved histograms for selectivity estimation of range predicates. In Proceedings of ACM SIGMOD, 1996   DOI
12 G. Navarro and R. Baeza-Yates, A practical q-gram index for text retrieval allowing errors. CLEL Electronic Journal, Volume 1. Number 2, 1998
13 E. M. McCreight. A space-economical suffix tree construction algorithm. Journal of the ACM, Volume 23, pages 262-272. 1976   DOI   ScienceOn
14 H. V. Jagadish, H. T. Ng, and D. Srivastava. On effective multi-dimensional indexing for strings. In Proceedings of ACM SIGMOD, pages 403-414, 2000
15 P. Fcrragina, N. Koudas, S. Muthukrishnan, and D. Srivastava. Two-dimensional substring indexing. In Proceedings of ACM Symposium on Principles of Database Systems, 2001   DOI
16 G. Navarro, E. Sutinen, J. Tannincn, and J. Tarhio. Indexing text with approximate q grams. In Proceedings of Combinatorial Pattern Matching, pages 350-363, 2000
17 E. Ukkonen. Approximate string-matching with q-grams and maximal matches. Theoretical Computer Science, Volume 92, pages 192-211, 1992   DOI   ScienceOn