DOI QR코드

DOI QR Code

Gene Sequences Clustering for the Prediction of Functional Domain

기능 도메인 예측을 위한 유전자 서열 클러스터링

  • Published : 2006.10.01

Abstract

Multiple sequence alignment is a method to compare two or more DNA or protein sequences. Most of multiple sequence alignment tools rely on pairwise alignment and Smith-Waterman algorithm to generate an alignment hierarchy. Therefore, in the existing multiple alignment method as the number of sequences increases, the runtime increases exponentially. In order to remedy this problem, we adopted a parallel processing suffix tree algorithm that is able to search for common subsequences at one time without pairwise alignment. Also, the cross-matching subsequences triggering inexact-matching among the searched common subsequences might be produced. So, the cross-matching masking process was suggested in this paper. To identify the function of the clusters generated by suffix tree clustering, BLAST and CDD (Conserved Domain Database)search were combined with a clustering tool. Our clustering and annotating tool consists of constructing suffix tree, overlapping common subsequences, clustering gene sequences and annotating gene clusters by BLAST and CDD search. The system was successfully evaluated with 36 gene sequences in the pentose phosphate pathway, clustering 10 clusters, finding out representative common subsequences, and finally identifying functional domains by searching CDD database.

Keywords

References

  1. D. W. Mount, 'Bioinformatics: Sequence and genome analysism,' Cold Spring Harbor Laboratory Press, New York, pp. 3-5, 2001
  2. J. Y. Chen and J. V. Carlis, 'Genomic data modeling,' Information Systems, vol. 28, pp. 287, 2003 https://doi.org/10.1016/S0306-4379(02)00071-6
  3. J. M. Ostell, S. J. Wheelan, and J. A. Kans, 'The NCBI data model.,' Methods Biochem. Anal,. vol. 43, pp. 19, 2001 https://doi.org/10.1002/0471223921.ch2
  4. N. Volfovsky, B. J. Haas, and S. L. Salzberg, 'A clustering method for repeat analysis in DNA sequences,' Genome Biol., vol. 2, pp. 1-11, 2001
  5. A. L. Deicher, S. Kasif, R. D. Fleischmann, J. Peterson, O. White, and S. L. Salzberg, 'Alignment of whole genomes,' Nucleic Acids Res., vol. 27(11), pp. 2369-2376, 1999 https://doi.org/10.1093/nar/27.11.2369
  6. A. L. Delcher, A. Phillippy, J. Carlton, and S. L. Salzberg, 'Fast algorithms for large-scale genome alignment and comparisonm,' Nucleic Acids Res., vol. 30(11), pp. 2478-2483, 2002 https://doi.org/10.1093/nar/30.11.2478
  7. A. Kalyanaraman, S. Aluru, and S. Kothari, 'Parallel EST clustering,' HICOMB, 185, 2002
  8. S. I. Han, S. G. Lee, B. K. Hou, S. H. Park, Y. H. Kim, and K. S. Hwang, 'A gene clustering method with masking cross-matching fragments using modified suffix tree clustering method,' Korean J. Chem. Eng., vol. 22(3), pp. 345, 2005 https://doi.org/10.1007/BF02719409
  9. O. Zamir, O. Etzioni, O. Madani and R. M. Karp, 'Fast and intuitive clustering of web documents,' In Proc. of the 3rd International Conference on Knowledge Discovery and Data Mining, pp. 287-290, 1997
  10. S. F. Altschul, W. Gish, W. Miller, E. Myers, and D. J. Lipman, 'Basic local alignment search tool,' J. Mol. Biol., vol. 215, pp. 403-410, 1990 https://doi.org/10.1016/S0022-2836(05)80360-2
  11. E. Ukkonen, 'On-line construction of suffix trees,' Algorithmica, vol. 14, pp. 249-260, 1995 https://doi.org/10.1007/BF01206331
  12. D. Gusfield, 'Algorithms on strings, trees, and sequences: computer science and computational biology,' Cambridge University Press, London, pp. 116, 1997