[KSCI] Korea Science Citation Index Service

A Practical Approximate Sub-Sequence Search Method for DNA Sequence Databases

Won, Jung-Im (한양대학교 정보통신학부)
Hong, Sang-Kyoon (연세대학교 컴퓨터과학과)
Yoon, Jee-Hee (한림대학교 정보통신공학부)
Park, Sang-Hyun (연세대학교 컴퓨터과학과)
Kim, Sang-Wook (한양대학교 정보통신학부)

Publication Information

Journal of KIISE:Databases / v.34, no.2, 2007 , pp. 119-132 More about this Journal

Abstract

In molecular biology, approximate subsequence search is one of the most important operations. In this paper, we propose an accurate and efficient method for approximate subsequence search in large DNA databases. The proposed method basically adopts a binary trie as its primary structure and stores all the window subsequences extracted from a DNA sequence. For approximate subsequence search, it traverses the binary trie in a breadth-first fashion and retrieves all the matched subsequences from the traversed path within the trie by a dynamic programming technique. However, the proposed method stores only window subsequences of the pre-determined length, and thus suffers from large post-processing time in case of long query sequences. To overcome this problem, we divide a query sequence into shorter pieces, perform searching for those subsequences, and then merge their results. To verify the superiority of the proposed method, we conducted performance evaluation via a series of experiments. The results reveal that the proposed method, which requires smaller storage space, achieves 4 to 17 times improvement in performance over the suffix tree based method. Even when the length of a query sequence is large, our method is more than an order of magnitude faster than the suffix tree based method and the Smith-Waterman algorithm.

Keywords

DNA sequence database; approximate subsequence search; indexing; trie; suffix tree;

Citations & Related Records

Reference

1	E. Horowitz, S. Sahni, and S. Anderson-Freed, Fundamentals of Data Structures in C, Computer Science Press, 1993
2	H. Wang et al., 'BLAST++: A Tool for BLASTing Queries in Batches,' In Proceedings First Asia-Pacific Bioinformatics Conference, pp. 71-79, 2003
3	G. Navarro and R. Baeza-Yates, 'A Hybrid Indexing Method for Approximate String Matching,' J. of Discrete ALgorithms, Vol. 1, No.1, pp. 205-239, 2000
4	S. Altschul, T. Madden, A. Schaffer, J. Zhang, W. Miller, and D. Lipman, 'Gapped BLAST and PSI-BLAST: A New Generation of Protein Database Search Programs,' Nucleic Acids Research, Vol 25, No. 17, pp. 3389-3402,1997 DOI ScienceOn
5	G. A. Stephen, String Searching Algorithms, World Scientific Publishing, 1994
6	A. L. Deicher, S. Kasif, R. D. Fleischmann, and J. Peterson, O. White, and S. L. Salzberg, 'Alignment of whole genomes,' Nucleic Acids Research, 27, pp. 2369-2376, 1999 DOI ScienceOn
7	E. Hunt, M. P. Atkinson and R. W. Irving, 'Database indexing for large DNA and protein sequence collections,' The VLDB Journal, Vol. 11, No.3, pp. 256-271, 2002 DOI
8	S. Altschul, W. Gish, W. Miller, E. Myers, and D. Lipman, 'Basic local alignment search tool,' Journal of Molecular Biology, Vol. 215, No.3, pp. 403-410, 1990 DOI ScienceOn
9	http://www.ncbi.nlm.nih.gov
10	H. Shang and T. H. Merrett, 'Tries for approximate string matching,' IEEE Trans. on Knowlege and Data Engineering, Vol. 8, No.4, pp. 540-547, 1996 DOI ScienceOn
11	V. Makinen and G. Navarro, 'Compressed Compact Suffix Arrays,' CPM 2004, Springer-Verlag LNCS 3109, pp. 420-433
12	U. Manber and G. Myers, 'Suffix arrays: A new method for on-line string searches,' SIAM J. Comput. 22, pp. 935-948, 1993 DOI ScienceOn
13	S. Tata, R. Hankins, and J. Patel, 'Practical Suffix Tree Construction,' In Proceedings of the 30th VLDB Conference, pp. 36-47, 2004
14	V. Makinen, 'Compact Suffix Array: A Space efficient Full-text Index,' Fundamenta Informaticae, 56(1-2), pp. 191-210, 2003
15	K. Kelly and P. Labute, 'The A* Search and Applications to Sequence Alignment,' http://www.chemcomp.com/article/astar.htm, 1996
16	T. Kahveci and A. K. Singh, 'An Efficient Index Structure for String Databases,' In Proceedings of the 27th VLDB Conference, pp. 351-360, 2001
17	C. Fondrat and P. Dessen, 'A Rapid Access Motif database(RAMdb) with a search algorithm for the retrieval patterns in nucleic acids or proteun databanks,' Computer Applications in the Biosciences. Vol. 11, No.3, pp. 273-279, 1995 DOI ScienceOn
18	R. Giegerich, S. Kurtz, and J. Stoye, 'Efficient Implementation of Lazy Suffix Trees,' Softw. Pract. Exp., Vol 33, pp. 1035-1049, 2003 DOI ScienceOn
19	A. Califano and I. Rigoutso, 'FLASH: A Fast Look-up Algorithm for String Homology,' In Proceedings of Intelligent System Conference for Morecular Biology, pp. 56-64, 1993
20	E. Ukkonen, 'Approximate string matching over suffix trees,' In Proceedings of Combinatorial Pattern Matching (CPM93), pp. 228-242, 1993 DOI ScienceOn
21	S. Kurtz, 'Reducing the Space Requirement of Suffix Trees,' Softw. Pract. Exp., Vol 29, pp. 1149-1171, 1999 DOI ScienceOn
22	C. Meek, J. M. Patel, and S. Kasetty, 'OASIS: An Online and Accurate Technique for Local-Alignment Searches on Biological sequences,' In Proceedings of the 29th VLDB Conference, pp. 920-921, 2003
23	K. Sadakane and T. Shibuya. 'Indexing huge genome sequences for solving various problems,' In Proceedings of the 12th Genome Informatics, pp. 175-183, 2001
24	S. Kurtz, J. Choudhuri, E. Ohlebusch, C. Schleiermacher, J. Stoye, and R. Giegerich, 'REPuter: the manifold applications of repeat analysis on a genome scale,' Nucleic Acids Research, Vol. 29, No. 22, pp. 4633-4642, 2001 DOI ScienceOn
25	C. Gibas and P. Jambeck, Developing Bioinformatics Computer Skills, O'Reilly and Associates Inc., 2001
26	Z. Tan, X. Cao, B. Ooi, and A. Tung, 'The ed-tree: An Index for Large DNA Sequence Databases,' In Proceedings of SSDBM Conference, pp. 1-10, 2003
27	H. E. Williams and J. Zobel, 'Indexing and Retrieval for Genomic Databases,' IEEE TKDE Vol. 14, No. 1. pp. 63-78, 2002 DOI ScienceOn
28	T. Smith and M. Waterman, 'Identification of Common Molecular Subsequences,' Journal of Molecular Biology, 147, pp. 195-197, 1981 DOI

KSCI

A Practical Approximate Sub-Sequence Search Method for DNA Sequence Databases DNA 시퀀스 데이타베이스를 위한 실용적인 유사 서브 시퀀스 검색 기법

A Practical Approximate Sub-Sequence Search Method for DNA Sequence Databases