Browse > Article
http://dx.doi.org/10.5302/J.ICROS.2005.11.10.851

A Study on Clustering and Identifying Gene Sequences using Suffix Tree Clustering Method and BLAST  

Han, Sang-Il (부산대학교 화학공학과)
Lee, Sung-Gun (부산대학교 화학공학과)
Kim, Kyung-Hoon (부산대학교 화학공학과)
Lee, Ju-Yeong (부산대학교 화학공학과)
Kim, Young-Han (동아대학교 화학공학과)
Hwang, Kyu-Suk (부산대학교 화학공학과)
Publication Information
Journal of Institute of Control, Robotics and Systems / v.11, no.10, 2005 , pp. 851-856 More about this Journal
Abstract
The DNA and protein data of diverse species have been daily discovered and deposited in the public archives according to each established format. Database systems in the public archives provide not only an easy-to-use, flexible interface to the public, but also in silico analysis tools of unidentified sequence data. Of such in silico analysis tools, multiple sequence alignment [1] methods relying on pairwise alignment and Smith-Waterman algorithm [2] enable us to identify unknown DNA, protein sequences or phylogenetic relation among several species. However, in the existing multiple alignment method as the number of sequences increases, the runtime increases exponentially. In order to remedy this problem, we adopted a parallel processing suffix tree algorithm that is able to search for common subsequences at one time without pairwise alignment. Also, the cross-matching subsequences triggering inexact-matching among the searched common subsequences might be produced. So, the cross-matching masking process was suggested in this paper. To identify the function of the clusters generated by suffix tree clustering, BLAST was combined with a clustering tool. Our clustering and annotating tool is summarized as the following steps: (1) construction of suffix tree; (2) masking of cross-matching pairs; (3) clustering of gene sequences and (4) annotating gene clusters by BLAST search. The system was successfully evaluated with 22 gene sequences in the pyrubate pathway of bacteria, clustering 7 clusters and finding out representative common subsequences of each cluster
Keywords
clustering; suffix tree; gene; BLAST; database;
Citations & Related Records
Times Cited By KSCI : 1  (Citation Analysis)
연도 인용수 순위
1 D. Gusfield, 'Algorithms on strings, trees, and sequences: computer science and computational biology,' Cambridge University Press, London, pp. 116, 1997
2 E. Ukkonen, 'On-line construction of suffix trees,' Algorithmica 14, pp. 353-364, 1993   DOI
3 E. M. McCreight, 'A space-economical suffix tree construction algorithms,' J. ACM 23, pp. 262-272, 1976   DOI   ScienceOn
4 S. I. Han, S. G. Lee, B. K. Hou, S. H. Park, Y. H. Kim and K. S. Hwang, 'A gene clustering method with masking cross-matching frahments using modified suffic tree clustering method,' Korean J. Chem. Eng., vol. 22(3), pp. 345, 2005   DOI
5 S. Altschul, W. Gish, W. Miller, E. Myers, and D. Lipman, 'Basic local alignment search tool,' Journal of Molecular Biology, Vol. 215, No. 3, pp. 403-410, 1990   DOI
6 O. Zamir, O. Etzioni, O, Madani and R. M. Karp, 'Fast and intuitive clustering of Web documents,' In Proc. of the 3rd International Conference on Knowledge Discovery and Data Mining, pp. 287-290, 1997
7 A. Kalyanaraman, S. Aluru and S. Kothari, 'Parallel EST clustering,' HICOMB, 185, 2002
8 N. Volfovsky, B. J. Haas and S. L. Salzberg, 'A clustering method for repeat analysis in DNA sequences,' Genome Biol., vol. 2, pp. 1-11, 2001   DOI
9 A. L. Delcher, S. Kasif, R. D. Fleischmann, J. Peterson, O. White and S. L. Salzberg, 'Alignment of whole genomes,' Nucleic Acids Res., vol. 27(11), pp. 2369-2376, 1999   DOI   ScienceOn
10 A. L, Delcher, A. Phillippy, J. Carlton and S. L. Salzberg, 'Fast algorithms for large-scale genome alignment and comparisonm,' Nucleic Acids Res., vol. 30(11), pp. 2478-2483, 2002   DOI   ScienceOn
11 D. W. Mount, 'Bioinformatics: sequence and genome analysism,' Cold Spring Harbor Laboratory Press, New York, pp. 3-5, 2001
12 J. M. Ostell, S, J. Wheelan and J. A. Kans, 'The NCBI data model,' Methods Biochem. Anal., vol. 43, pp. 19, 2001   DOI
13 T. F. Smith and M. S. Waterman, 'Identification of common molecular sequences,' J. Mol. Biol., vol. 147, pp. 195-197, 1981   DOI
14 J. Y. Chen and J. V. Carlis, 'Genomic data modeling,' Information Systems, vol. 28, pp. 287, 2003   DOI   ScienceOn
15 C. Notredame and D. G. Higgins, 'SAGA: sequence alignment by genetic algorithm,' Nucleic Acids Res., vol. 24, pp. 1515-1524, 1996   DOI