DOI QR코드

DOI QR Code

An Efficient Suffix Tree Reconstructing Algorithm for Biological Sequence Analysis

DNA 분석에 효율적인 서픽스 트리 재구성 알고리즘

  • 최해원 (경운대학교 컴퓨터공학과) ;
  • 정영석 (경운대학교 컴퓨터공학과) ;
  • 김상진 (경운대학교 컴퓨터공학과)
  • Received : 2014.09.12
  • Accepted : 2014.12.20
  • Published : 2014.12.28

Abstract

This paper introduces a new algorithms for reconstructing the suffix tree of character string, when a substring id deleted from the string or a string is inserted into the string as a substring. The algorithem has two main functions, delete-structure and insert-structure. The main objective of this algorithm is to save the time for constructing the suffix tree of an edited string, when the suffix tree of the original string is available. We tested the performance of this algorithm with some DNA sequences. This test shows that delete-reconstructing can save time when the length of the subsequence deleted is less than 30% of the original sequence, and the insert-reconstructing takes less time with regard to the length of inserted sequence.

서픽스 트리는 주어진 모든 문자열의 모든 서픽스를 트리 형태로 나타내는 자료구조로서 선형시간에 구성할 수 있으며 문자열에 대한 많은 문제를 효율적으로 해결할 수 있다. 하지만 이런 효용성에도 불구하고 서픽스 트리로 구성한 문자열을 삽입/삭제하는 경우 트리를 구성하는데 상당히 많은 시간이 소비된다. 본 논문은 이러한 문제를 해결하기 위한 서픽스 트리 재구성 알고리즘을 제안한다. 제안하는 알고리즘은 부 문자열을 삽입하는 경우와 삭제하는 경우로 나눈 다음, 발생할 수 있는 모든 경우의 수를 감안해서 설계했다. 알고리즘의 성능을 평가하기 위해서 기존의 Ukkonen 알고리즘과 비교실험 해 본 결과 서픽스 트리 재구성 시 30% 이상 시간이 절약됨을 알 수 있었다.

Keywords

References

  1. D. Gusfield, Algorithm on String, Tree, and Sequence, Cambridge University Press, pp. 87-107, 1997.
  2. David W., Bioinformatics, sequences and Genome Analysis, MOUNT Press, 2001.
  3. Younshin Oh, Dinh Truong Nguyen, Identification of 1,531 cSNPs from Full-length Enriched cDNA Libraries of the Korean Native Pig Using in Silico Analysis, Genomics & Informatics, vol. 7, no. 2, 2009, pp. 65-84. https://doi.org/10.5808/GI.2009.7.2.065
  4. Josue Samayoa, Fitnat H. Yildiz and Kevin Karplus, Identification of prokaryotic small proteins using a comparative genomic approach, Bioinformatics, vol.27, no.13, 2011, pp. 1765-1771. https://doi.org/10.1093/bioinformatics/btr275
  5. Chan Park, Ji-Seong Jeong, Design and Implementation of Bio-Medical Data Measurement System through the Stereo Microscope, Korea Contents Association KISTI-KOCON ICCC2009, November, vol.7, no.2, 2009, pp. 357-360.
  6. Young-Ohk Song, Sung-young Kim and Duk-Jin Chang, Design of the System and Algorithm for the Pattern Analysis of the Bio-Data, Korea Contents Association, November, vol.10, no.8, 2008, pp. 104-110. https://doi.org/10.5392/JKCA.2010.10.8.104
  7. Audry P. G., Alan M.M., Conservation and Evolution of Cis-Regulatory Systems in Ascomycete Fungi, PLOS Biology, vol. 2, no. 12, 2004, pp. 398-405. https://doi.org/10.1371/journal.pbio.0020398
  8. Ketil Malde, Eivind Coward and Inge Jonassen, Fast sequence clustring using suffix array algorithm, Bioinformatics, 2003, pp. 1221-1226.
  9. E. Ukkonen. On-line construction of suffix-trees. Algorithmica, 1995, pp. 249-67.
  10. E.M. McCreight, A Space-Economical Suffix Tree Construction Algorithm. Journal of the ACM, vol. 23, no.2, 1976, pp. 262-272. https://doi.org/10.1145/321941.321946
  11. R.A.Gibbs and S.Kurtz, From Ukkonen to McCreight and Weiner: A unifying view of linear-time suffix tree construction, Algorithmica. 1997.
  12. William S.Klug and Michael R.Cummings, Genetics. Sixth Edition, pp. 251-281.
  13. Uwe Ohler, Promoter Prediction on a Genomic scale-the Adh Experience, Preprint from Genomes Res. 2000, pp. 539-542.
  14. Ogasawara, j. and Morishita, S. Fast and Sensitive Algorithm for Aligning Ests to Human Genome. Bioinfomatics Conference, Proceedings. IEEE Computer Society, 2002, pp. 43-53.
  15. Mark Nelson, "Fast String Searching With Suffix Trees", Dr. Dobb's Journal, 1996.