Browse > Article

A Space Efficient Indexing Technique for DNA Sequences  

Song, Hye-Ju (숙명여자대학교 멀티미디어과학과)
Park, Young-Ho (숙명여자대학교 멀티미디어과학과)
Loh, Woong-Kee (성결대학교 멀티미디어학부)
Abstract
Suffix trees are widely used in similar sequence matching for DNA. They have several problems such as time consuming, large space usages of disks and memories and data skew, since DNA sequences are very large and do not fit in the main memory. Thus, in the paper, we present a space efficient indexing method called SENoM, allowing us to build trees without merging phases for the partitioned sub trees. The proposed method is constructed in two phases. In the first phase, we partition the suffixes of the input string based on a common variable-length prefix till the number of suffixes is smaller than a threshold. In the second phase, we construct a sub tree based on the disk using the suffix sets, and then write it to the disk. The proposed method, SENoM eliminates complex merging phases. We show experimentally that proposed method is effective as bellows. SENoM reduces the disk usage less than 35% and reduces the memory usage less than 20% compared with TRELLIS algorithm. SENoM is available to query efficiently using the prefix tree even when the length of query sequence is large.
Keywords
Suffix Tree; Variable-length prefix;
Citations & Related Records
연도 인용수 순위
  • Reference
1 S. F. Altschul et aI, 'Basic Local Alignment Se-arch Tool,' Journal of the Molecular Biology, vol.215, no.3, pp.403-410, 1990   DOI
2 D. Yao, C. Shahabi and P. A. Larson, 'Hash-based Labeling Techniques for Storage Scaling,' Journal of the VLDB, vol.14, no.2, pp.222-237, 2005   DOI   ScienceOn
3 P. Krishnan, J S. Vitter and B. R. Iyer, 'Estimating Alphanumeric Selectivity in the Presence,' In Proc. of the 1996 ACM SIGMOD Int''l Conf. on Management of Data, vol.25, no.2, pp.282-293, June, 1996   DOI   ScienceOn
4 Y. Tian et al., 'Practical Methods for Constructing Suffix Trees,' Journal of the VLDB, vol.14, no.3, pp.281-299, 2005   DOI   ScienceOn
5 S. Tata, R. Hankins and J Patel, 'Practical Suffix Tree Construction,' In Proc. of the VLDB Int'l Conf., vol.23, no.2, pp.36-47, 2004
6 E. Hunt, M. P. Atkinson and R. W. Irving, 'Database Indexing for Large DNA and Protein Sequence Collections,' Journal of the VLDB, vol.11, no.3, pp.256-271, 2002   DOI   ScienceOn
7 C. F. Cheung, J X. Yu and H. Lu, 'A Compact Partitioned Suffix Tree for Disk-based Indexing on Large Genome Sequences,' IEEE Trans. on Knowledge and Data Engineering, vol.17, no.l, pp.90-105, 2005   DOI   ScienceOn
8 E. Hunt, M. Atkinson and R. Irving, 'A Database Index to Large Biological Sequences,' In Proc. of the VLDB Int'l Conf., vol.7, no.3, pp.139-148, 2001
9 R. Giegerich, S. Kurtz and J Stoye, 'Efficient Implementation of Lazy Suffix Trees,' In Proc. of the Workshop on Algorithm Engineering, vol.33, no.11, pp.1035-1049, 2003   DOI   ScienceOn
10 M.Farach-Colton, P.Ferragina and S.Muthukrishnan, 'Overcoming the Memory Bottleneck in Suffix Tree Construction,' Journal of the ACM, vol.47, no.6, pp.987-1011, 2007   DOI   ScienceOn
11 J I. Won, S. K. Hong, J H. Yoon, S. H. Park and S. W. Kim, 'A Practical Method for Approximate Subsequence Search in DNA Databases,' In Proc. of the PAKDD'2007, pp.921-931, 2007   DOI   ScienceOn
12 E. Ukkonen and J Karkkainen, 'On-line Construction of Suffix Trees,' Journal of the Association for Computing Machinery, vol.14, no.3, pp.262-272, 1995   DOI
13 B. Phoophakdee and M. J. Zaki, 'Genome-scale Disk-based Suffix Tree Indexing,' In Proc. of the ACM SIGMOD Int'l Conf. on Management of Data, pp.833-844, 2007   DOI
14 S .Bedathur and J Haritsa, 'Engineering a Fast Online Persistent Suffix Tree Construction,' In Proc. of the IEEE 20th Int'l Conf. on Data Engineering, vol.20, pp.720-731, 2004   DOI
15 E. McCreight, 'A Space-Economical Suffix Tree Construction Algorithm,' Journal of the ACM, vol.15, no.2, pp.514-534, 1976   DOI   ScienceOn
16 D. G. Jeffrey et al, 'BeoBLAST: Distributed BLAST and PSI-BLAST on a Beowulf cluster,' Journal of the Bioinformatics, vol.18, no.5, pp. 765-766, 2002   DOI   ScienceOn