An Analysis System for Whole Genomic Sequence Using String B-Tree

스트링 B-트리를 이용한 게놈 서열 분석 시스템

  • 최정현 (부산대학교 대학원 전자계산학과) ;
  • 조환규 (부산대학교 전기전자정보컴퓨터공학부)
  • Published : 2001.12.01

Abstract

As results of many genome projects, genomic sequences of many organisms are revealed. Various methods such as global alignment, local alignment are used to analyze the sequences of the organisms, and k -mer analysis is one of the methods for analyzing the genomic sequences. The k -mer analysis explores the frequencies of all k-mers or the symmetry of them where the k -mer is the sequenced base with the length of k. However, existing on-memory algorithms are not applicable to the k -mer analysis because a whole genomic sequence is usually a large text. Therefore, efficient data structures and algorithms are needed. String B-tree is a good data structure that supports external memory and fits into pattern matching. In this paper, we improve the string B-tree in order to efficiently apply the data structure to k -mer analysis, and the results of k -mer analysis for C. elegans and other 30 genomic sequences are shown. We present a visualization system which enables users to investigate the distribution and symmetry of the frequencies of all k -mers using CGR (Chaotic Game Representation). We also describe the method to find the signature which is the part of the sequence that is similar to the whole genomic sequence.

생명 과학의 발전과 많은 게놈(genome) 프로젝트의 결과로 여러 종의 게놈 서열이 밝혀지고 있다. 생물체의 서열을 분석하는 방법은 전역정렬(global alignment), 지역정렬(local alignment) 등 여러 가지 방법이 있는데, 그 중 하나가 k-mer 분석이다. k-mer는 유전자의 염기 서열내의 길이가 k인 연속된 염기 서열로서 k-mer 분석은 염기서열이 가진 k-mer들의 빈도 분포나 대칭성 등을 탐색하는 것이다. 그런데 게놈의 염기 서열은 대용량 텍스트이고 k가 클 때 기존의 온메모리 알고리즘으로는 처리가 불가능하므로 효율적인 자료구조와 알고리즘이 필요하다. 스트링 B-트리는 패턴 일치(pattern matching)에 적합하고 외부 메모리를 지원하는 좋은 자료구조이다. 본 논문에서는 스트링 B-트리(string B-tree)를 k-mer 분석에 효율적인 구조로 개선하여, C. elegans 외의 30개의 게놈 서열에 대해 분석한다. k-mer들의 빈도 분포와 대칭성을 보여주기 위해 CGR(Chaotic Game Representation)을 이용한 가시화 시스템을 제시한다. 게놈 서열과 매우 유사한 서열 상의 어떤 부분을 시그니쳐(signature)라 하고, 높은 유사도를 가지는 최소 길이의 시그니쳐를 찾는 알고리즘을 제시한다.

Keywords

References

  1. C. Burge, A.M. Campbell, and S. Karlin, 'Over-and under-representation of short oligonucleotides in DNA sequences,' Proc. Natl. Acad. Sci., Vol.89, pp.1358-1362, 1992 https://doi.org/10.1073/pnas.89.4.1358
  2. S. Karlin and I. Ladunga, 'Comparisons of eukaryotic genomic sequences,' Proc. Natl. Acad. Sci., Vol.91, pp.12832-12836, 1994 https://doi.org/10.1073/pnas.91.26.12832
  3. B.E. Blaisdell, A.M. Campbell, and S. Karlin, 'Similarities and dissimilarities of phage genomes,' Proc. Natl. Acad. Sci., Vol.93, pp.5854-5859, 1996 https://doi.org/10.1073/pnas.93.12.5854
  4. S. Karlin and J. Mrazek, 'Compositional differences within and between eukaryotic genomes,' Proc. Natl. Acad. Sci., Vol.94, pp.10227-10232, 1997 https://doi.org/10.1073/pnas.94.19.10227
  5. S. Karlin, L. Brocchicri, J. Mrazek, A.M. Campbell, and A.M. Spormann, 'A chimeric prokaryotic ancestry of mito-chondria and primitive eukaryotes,' Proc. Natl. Acad. Sci. USA, Vol.96, No.16, pp.9190-9195, 1999 https://doi.org/10.1073/pnas.96.16.9190
  6. D.E. Kunth, J.H. Morris, and V.B. Pratt, 'Fast pattern matching,' Algorithmica, Vol.6, pp.323-350, 1977 https://doi.org/10.1137/0206024
  7. R.S. Boyer and J.S. Moore, 'A fast string matching algorithm,' Comm. ACM, Vol.20, pp.762-772, 1977 https://doi.org/10.1145/359842.359859
  8. R. Karp and M. Rabin, 'Efficient randomized pattern matching algorithm,' IBM J. Res. Development, Vol.21, pp.249-260, 1987
  9. P. Weiner, 'Linear pattern matching algorithm,' In Proceeding 14th IEEE Symposium on Switching and Automata Theroy, pp.1-11, 1973
  10. E.M. McCreight, 'A space-economical suffix trec construction algorithm,' Journal of ACM, Vol.23, No.12, pp.262-272, 1976 https://doi.org/10.1145/321941.321946
  11. E. Ukkonen, 'On-line construction of suffix trees,' Algorithmica, Vol.14, No.3, pp.249-260, 1995 https://doi.org/10.1007/BF01206331
  12. D.R. Clark and J.I. Munro, 'Efficient suffix trees on secondary storage,' In Proceedings of the 7th Annual ACM-SIAM Symposium on Deiscrete Algorithms (SODA), pp.383-391, 1996
  13. P. Ferragina and R. Grossi, 'The string B-tree : A new data structure for string search in external memory and its application,' Journal of ACM, Vol.46, No.2, pp.236-280, 1999 https://doi.org/10.1145/301970.301973
  14. U. Manber and G. Myers, 'Suffix arrays : A new method for on-line string searches,' SIAM Journal on Computing, Vol.22, No.5, pp.935-948, 1993 https://doi.org/10.1137/0222058
  15. A. Compell, J. Mrazek, and S. Karlin, 'Genome signature comparisons among prokaryote, plasmid, and mitochondrial DNA,' Proc. Natl. Acad. Sci., Vol.96, pp.9184-9189, 1999 https://doi.org/10.1073/pnas.96.16.9184
  16. P.J. Deschavanne, A. Giron, J. Vilain, G. Fagot, and B. Fertil, 'Genomic signature : Characterization and classification of species assessed by chaos game representation of sequences,' Mol. Biol. Evol., Vol.16, pp.1391-1399, 1999
  17. H.J. Jeffery, 'Chaos gmae representation of gene structure,' Nucleic Acids Res., Vol.18, pp.2163-2170, 1990 https://doi.org/10.1093/nar/18.8.2163
  18. P. Tino, 'Spatial representation of symbolic sequences through iterative function system,' IEEE Trans. Syst. Man Cybernet., Vol.29, pp.386-393, 1999 https://doi.org/10.1109/3468.769757
  19. S. Basu, A. Pam and J. Das, 'Chaos game representation of protein,' J. Mol. Graphics Mod., Vol.15, pp.279-289, 1997 https://doi.org/10.1016/S1093-3263(97)00106-X
  20. K.P. PleiBner, L. Wernisch, H. Osvald, and E. Fleck, 'Representation of amino acid sequences as two-dimensional point patterns,' Electrophoresis, Vol.18, pp.2709-2713, 1997 https://doi.org/10.1002/elps.1150181504
  21. J.S. Almeidal, J.A. Carrico, A. Maretzek, P.A. Noble, and M. Fletcher, 'Analysis of genomic sequences by Chaos Game Representation,' Bioinformatics, Vol.17, No.5, pp.429-437, 2001 https://doi.org/10.1093/bioinformatics/17.5.429
  22. N. Goldman, 'Nucleotide, dinucleotide, and trinucleodtide frequencies explain patterns observed in chaos game respresentations of DNA sequences,' Nuclear Acids Res., Vol.21, pp.2487-2491, 1993 https://doi.org/10.1093/nar/21.10.2487
  23. P. Baldi and S. Brunak, 'Bioinformatics : the machine learning approach,' MIT Press, 1998