[KSCI] Korea Science Citation Index Service

http://dx.doi.org/10.5302/J.ICROS.2005.11.10.851

A Study on Clustering and Identifying Gene Sequences using Suffix Tree Clustering Method and BLAST

Han, Sang-Il (부산대학교 화학공학과)
Lee, Sung-Gun (부산대학교 화학공학과)
Kim, Kyung-Hoon (부산대학교 화학공학과)
Lee, Ju-Yeong (부산대학교 화학공학과)
Kim, Young-Han (동아대학교 화학공학과)
Hwang, Kyu-Suk (부산대학교 화학공학과)

Publication Information

Journal of Institute of Control, Robotics and Systems / v.11, no.10, 2005 , pp. 851-856 More about this Journal

Abstract

The DNA and protein data of diverse species have been daily discovered and deposited in the public archives according to each established format. Database systems in the public archives provide not only an easy-to-use, flexible interface to the public, but also in silico analysis tools of unidentified sequence data. Of such in silico analysis tools, multiple sequence alignment [1] methods relying on pairwise alignment and Smith-Waterman algorithm [2] enable us to identify unknown DNA, protein sequences or phylogenetic relation among several species. However, in the existing multiple alignment method as the number of sequences increases, the runtime increases exponentially. In order to remedy this problem, we adopted a parallel processing suffix tree algorithm that is able to search for common subsequences at one time without pairwise alignment. Also, the cross-matching subsequences triggering inexact-matching among the searched common subsequences might be produced. So, the cross-matching masking process was suggested in this paper. To identify the function of the clusters generated by suffix tree clustering, BLAST was combined with a clustering tool. Our clustering and annotating tool is summarized as the following steps: (1) construction of suffix tree; (2) masking of cross-matching pairs; (3) clustering of gene sequences and (4) annotating gene clusters by BLAST search. The system was successfully evaluated with 22 gene sequences in the pyrubate pathway of bacteria, clustering 7 clusters and finding out representative common subsequences of each cluster

Keywords

clustering; suffix tree; gene; BLAST; database;

Citations & Related Records

Times Cited By KSCI : 1 (Citation Analysis)

Reference
Cited By KSCI

1	D. Gusfield, 'Algorithms on strings, trees, and sequences: computer science and computational biology,' Cambridge University Press, London, pp. 116, 1997
2	E. Ukkonen, 'On-line construction of suffix trees,' Algorithmica 14, pp. 353-364, 1993 DOI
3	E. M. McCreight, 'A space-economical suffix tree construction algorithms,' J. ACM 23, pp. 262-272, 1976 DOI ScienceOn
4	S. I. Han, S. G. Lee, B. K. Hou, S. H. Park, Y. H. Kim and K. S. Hwang, 'A gene clustering method with masking cross-matching frahments using modified suffic tree clustering method,' Korean J. Chem. Eng., vol. 22(3), pp. 345, 2005 DOI
5	S. Altschul, W. Gish, W. Miller, E. Myers, and D. Lipman, 'Basic local alignment search tool,' Journal of Molecular Biology, Vol. 215, No. 3, pp. 403-410, 1990 DOI
6	O. Zamir, O. Etzioni, O, Madani and R. M. Karp, 'Fast and intuitive clustering of Web documents,' In Proc. of the 3rd International Conference on Knowledge Discovery and Data Mining, pp. 287-290, 1997
7	A. Kalyanaraman, S. Aluru and S. Kothari, 'Parallel EST clustering,' HICOMB, 185, 2002
8	N. Volfovsky, B. J. Haas and S. L. Salzberg, 'A clustering method for repeat analysis in DNA sequences,' Genome Biol., vol. 2, pp. 1-11, 2001 DOI
9	A. L. Delcher, S. Kasif, R. D. Fleischmann, J. Peterson, O. White and S. L. Salzberg, 'Alignment of whole genomes,' Nucleic Acids Res., vol. 27(11), pp. 2369-2376, 1999 DOI ScienceOn
10	A. L, Delcher, A. Phillippy, J. Carlton and S. L. Salzberg, 'Fast algorithms for large-scale genome alignment and comparisonm,' Nucleic Acids Res., vol. 30(11), pp. 2478-2483, 2002 DOI ScienceOn
11	D. W. Mount, 'Bioinformatics: sequence and genome analysism,' Cold Spring Harbor Laboratory Press, New York, pp. 3-5, 2001
12	J. M. Ostell, S, J. Wheelan and J. A. Kans, 'The NCBI data model,' Methods Biochem. Anal., vol. 43, pp. 19, 2001 DOI
13	T. F. Smith and M. S. Waterman, 'Identification of common molecular sequences,' J. Mol. Biol., vol. 147, pp. 195-197, 1981 DOI
14	J. Y. Chen and J. V. Carlis, 'Genomic data modeling,' Information Systems, vol. 28, pp. 287, 2003 DOI ScienceOn
15	C. Notredame and D. G. Higgins, 'SAGA: sequence alignment by genetic algorithm,' Nucleic Acids Res., vol. 24, pp. 1515-1524, 1996 DOI

KSCI

A Study on Clustering and Identifying Gene Sequences using Suffix Tree Clustering Method and BLAST 서픽스트리 클러스터링 방법과 블라스트를 통합한 유전자 서열의 클러스터링과 기능검색에 관한 연구

A Study on Clustering and Identifying Gene Sequences using Suffix Tree Clustering Method and BLAST