DOI QR코드

DOI QR Code

Approximate Periods of Strings based on Distance Sum for DNA Sequence Analysis

DNA 서열분석을 위한 거리합기반 문자열의 근사주기

  • 정주희 (인하대학교 컴퓨터정보공학과) ;
  • 김영호 (인하대학교 컴퓨터정보공학과) ;
  • 나중채 (세종대학교 컴퓨터공학과) ;
  • 심정섭 (인하대학교 컴퓨터정보공학과)
  • Received : 2013.01.08
  • Accepted : 2013.01.24
  • Published : 2013.02.28

Abstract

Repetitive strings such as periods have been studied vigorously in so diverse fields as data compression, computer-assisted music analysis, bioinformatics, and etc. In bioinformatics, periods are highly related to repetitive patterns in DNA sequences so called tandem repeats. In some cases, quite similar but not the same patterns are repeated and thus we need approximate string matching algorithms to study tandem repeats in DNA sequences. In this paper, we propose a new definition of approximate periods of strings based on distance sum. Given two strings $p({\mid}p{\mid}=m)$ and $x({\mid}x{\mid}=n)$, we propose an algorithm that computes the minimum approximate period distance based on distance sum. Our algorithm runs in $O(mn^2)$ time for the weighted edit distance, and runs in O(mn) time for the edit distance, and runs in O(n) time for the Hamming distance.

주기와 같은 반복문자열에 대한 연구는 데이터압축, 컴퓨터활용 음악분석, 바이오인포매틱스 등 다양한 분야에서 진행되고 있다. 바이오인포매틱스 분야에서 주기는 유전자 서열이 반복적으로 나타나는 종렬중복과 밀접한 관련이 있으며 이는 근사문자열매칭을 이용한 근사주기 연구와 관련이 있다. 본 논문에서는 기존의 근사주기에 대한 정의를 보완하는 거리합기반 근사주기를 정의하고 이에 대한 연구 결과를 제시한다. 길이가 각각 m과 n인 문자열 p와 x가 주어졌을 때, p의 x에 대한 거리합기반 최소 근사주기거리를 가중편집거리에 대해 $O(mn^2)$ 시간, 편집거리에 대해 O)(mn) 시간, 해밍거리에 대해 O(n) 시간에 계산하는 알고리즘을 제시한다.

Keywords

References

  1. Y. Lifshits, "Solving classical string problems an compressed texts," In Combinatorial and Algorithmic Foundations of Pattern and Association Discovery, number 06201 in Dagstuhl Seminar Proceedings, 2006.
  2. E. Cambouropoulos, M. Crochemore, C. Iliopoulos, L. Mouchard, "Algorithms for computing approximate repetitions in musical sequences,"Journal of Computer Mathematics, 79, 11, 1135-1148, 2002.
  3. A.T. Castelo, W. Martins and G.R. Gao,"TROLL-tandem repeat occurrence locator," Bioinformatics, 18, 634-636, 2002. https://doi.org/10.1093/bioinformatics/18.4.634
  4. M. Crochemore, "String matching and periods," Bulletin of the European Association for Theoretical Computer Science, 39, 149-153, 1989.
  5. B. Langmead, C. Trapnel, M. Pop, SL. Salzberg, "Ultrafast and memory-efficient alignment of short DNA sequences to the human genome," Genome Biol, 10, R25, 2009. https://doi.org/10.1186/gb-2009-10-3-r25
  6. X. Liu and L. Wang, "Finding the Region of Pseudo-Periodic Tandem Repeats in Biological Sequences," Algorithms for Molecular Biology, Vol.1, No.1, pp.2, 2006. https://doi.org/10.1186/1748-7188-1-2
  7. R. Kolpakov, G. Bana, G. Kucherov, "mreps: Efficient and flexible detection of tandem repeats in DNA," Nucleic Acids Res, 31:3672-3678, 2003. https://doi.org/10.1093/nar/gkg617
  8. J.S. Sim, C.S. Iliopoulos, K. Park, W.F. Smyth, "Approximate periods of strings," Theoretical Computer Science, 262, 557-568, 2001. https://doi.org/10.1016/S0304-3975(00)00365-0
  9. D. Gussfield, "Algorithms on Strings, Trees, and Sequences," Cambridge University Press, 1997.
  10. T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein, "Introduction to Algorithms," 3nd ed. MIT Press, 2001.
  11. H. Hyyro, K. Narisawa, S. Inenaga, "Dynamic Edit Distance Table under a General Weighted Cost Function," SOFSEM 2010: Theory and Practice of Computer Science, 5901, 515-527, 2010.
  12. S.R. Kim and K. Park, "A dynamic edit distance table," Journal of Discrete Algorithms, 2, 303-312, 2004. https://doi.org/10.1016/S1570-8667(03)00082-0

Cited by

  1. δ-approximate Periods and γ-approximate Periods of Strings over Integer Alphabets vol.43, pp.10, 2016, https://doi.org/10.5626/JOK.2016.43.10.1073