Fast Matching Method for DNA Sequences

DNA 서열을 위한 빠른 매칭 기법

  • 김진욱 (인하대학교 컴퓨터정보공학부) ;
  • 김은상 (서울대학교 컴퓨터공학부) ;
  • 안융기 (서울대학교 컴퓨터공학부) ;
  • 박근수 (서울대학교 컴퓨터공학부)
  • Published : 2009.08.15

Abstract

DNA sequences are the fundamental information for each species and a comparison between DNA sequences of different species is an important task. Since DNA sequences are very long and there exist many species, not only fast matching but also efficient storage is an important factor for DNA sequences. Thus, a fast string matching method suitable for encoded DNA sequences is needed. In this paper, we present a fast string matching method for encoded DNA sequences which does not decode DNA sequences while matching. We use four-characters-to-one-byte encoding and combine a suffix approach and a multi-pattern matching approach. Experimental results show that our method is about 5 times faster than AGREP and the fastest among known algorithms.

DNA 서열은 각 종을 나타내는 근본적인 정보이며, 다른 종 간의 DNA 서열 비교는 중요한 작업이다. DNA 서열은 길이가 매우 길며 또 종의 종류도 다양하기 때문에, DNA 서열 비교에서는 빠른 매칭 뿐만 아니라 효율적인 저장도 중요한 요소이다. 즉, 인코딩 된 DNA 서열에 적합한 빠른 문자열 매칭 방법이 필요하다. 본 논문에서는 매칭 시 디코딩이 필요하지 않은 인코딩 된 DNA 서열을 위한 빠른 매칭 알고리즘을 제시한다. 제시하는 알고리즘은 네 문자 한 바이트 인코딩을 이용하며 서픽스 기법과 다중 패턴 매칭 기법을 접목하고 있다. 실험 결과로는 본 논문에서 제시하는 방법이 AGREP보다 약 다섯배 빠름을 보이는데, 이는 알려진 알고리즘들 중에서 가장 빠른 결과이다.

Keywords

References

  1. BLAST, http://www.ncbi.nlm.nih.gov/BLAST
  2. FASTA, http://www.ebi.ac.uk/fasta
  3. Gonzalo Navarro and Mathieu Raffinot. Flexible Pattern Matching in Strings: Practical On-Line Search Algorithms for Texts and Biological Sequences. Cambridge University Press, 2002.
  4. D. E. Knuth, J. H. Morris Jr, and V. R. Pratt. Fast Pattern Matching in Strings. SIAM Journal on Computing, 6:323-350, 1977. https://doi.org/10.1137/0206024
  5. Ricardo Baeza-Yates and Gaston H. Gonnet. A New Approach to Text Searching. Communications of the ACM, 35(10):74-82, 1992. https://doi.org/10.1145/135239.135243
  6. Kimmo Fredriksson and Szymon Grabowski. Practical and Optimal String Matching. 12th International Symposium on String Processing and Information Retrieval, Lecture Notes in Computer Science, 3772:376-387, 2005.
  7. Robert S. Boyer and J. Strother Moore. A Fast String Searching Algorithm. Communications of the ACM, 20(10):762-772, 1977. https://doi.org/10.1145/359842.359859
  8. R. Nigel Horspool. Practical Fast Searching in Strings. Software Practice and Experience, 10(6):501-506, 1980. https://doi.org/10.1002/spe.4380100608
  9. Daniel M. Sunday. A Very Fast Substring Search Algorithm. Communications of the ACM, 33(8):132-142, 1990. https://doi.org/10.1145/79173.79184
  10. Frantisek Franek, Christopher G. Jennings, and W. F. Smyth. A Simple Fast Hybrid Pattern-Matching Algorithm. 16th Annual Symposium on Combinatorial Pattern Matching, Lecture Notes in Computer Science, 3537:288-297, 2005.
  11. Gonzalo Navarro and Mathieu Raffinot. Fast and Flexible String Matching by Combining Bit- Parallelism and Suffix Automata. ACM Journal of Experimental Algorithmics, 5(4), 2000.
  12. Cyril Allauzen, Maxime Crochemore, and Mathieu Raffinot. Efficient experimental string matching by weak factor recognition. 12th Annual Symposium on Combinatorial Pattern Matching, Lecture Notes in Computer Science, 2089:51-72, 2001.
  13. Christian Charras, Thierry Lecroq, and Joseph Daniel Pehoushek. A Very Fast String Matching Algorithm for Small Alphabets and Long Patterns. 9th Annual Symposium on Combinatorial Pattern Matching, Lecture Notes in Computer Science, 1448:55-64, 1998.
  14. J.Y. Kim and J. Shawe-Taylor. Fast String Matching Using an n-gram Algorithm. Software: Practice and Experience, 24(1):79-88, 1994. https://doi.org/10.1002/spe.4380240105
  15. Jorma Tarhio and Hannu Peltola. String Matching in the DNA Alphabet. Software-Practice and Experience, 27(7):851-861, 1997. https://doi.org/10.1002/(SICI)1097-024X(199707)27:7<851::AID-SPE108>3.0.CO;2-D
  16. Sun Wu and Udi Manber. Fast Text Searching Allowing Errors. Communications of the ACM, 35(10):83-91, 1992. https://doi.org/10.1145/135239.135244
  17. Sun Wu and Udi Manber. AGREP - A Fast Approximate Pattern-matching Tool. the Winter 1992 USENIX Conference, pp.153-162, 1992.
  18. Amihood Amir and Gary Benson. Efficient Two-Dimensional Compressed Matching. Data Compression Conference, pp.279-288, 1992.
  19. Udi Manber. A Text Compression Scheme That Allows Fast Searching Directly in the Compressed File. ACM Transactions on Information Systems, 15(2):124-136, 1997. https://doi.org/10.1145/248625.248639
  20. Edleno Silva de Moura, Gonzalo Navarro, Nivio Ziviani, and Ricardo Baeza-Yates. Direct Pattern Matching on Compressed Text. 5th International Symposium on String Processing and Information Retrieval, IEEE Computer Society, pp. 90-95, 1998.
  21. Amihood Amir, Gary Benson, and Martin Farach. Let Sleeping Files Lie: Pattern Matching in Zcompressed Files. 5th Annual ACM-SIAM Symposium on Discrete Algorithms, pp.705-714, 1994.
  22. Gonzalo Navarro and Mathieu Raffinot. Practical and Flexible Pattern Matching over Ziv-Lempel Compressed Text. Journal of Discrete Algorithms, 2(3):347-371, 2004. https://doi.org/10.1016/j.jda.2003.12.002
  23. Gonzalo Navarro and Jorma Tarhio. LZgrep: a Boyer-Moore String Matching Tool for Ziv-Lempel Compressed Text. Software-Practice and Experience, 35(12):1107-1130, 2005. https://doi.org/10.1002/spe.663
  24. Yusuke Shibata, Takuya Kida, Shuichi Fukamachi, Masayuki Takeda, Anymi Shinohara, Takeshi Shinohara, and Setsuo Arikawa. Speeding Up Pattern Matching by Text Compression. 4th Italian Conference on Algorithms and Complexity, Lecture Notes in Computer Science, 1767:306-315, 2000.
  25. Yusuke Shibata, Tetsuya Matsumoto, Masayuki Takeda, Ayumi Shinohara, and Setsuo Arikawa. A Boyer-Moore Type Algorithm for Compressed Pattern Matching. 11th Annual Symposium on Combinatorial Pattern Matching, Lecture Notes in Computer Science, 1848:181-194, 2000.
  26. Kimmo Fredriksson. Shift-Or String Matching with Super-Alphabets. Information Processing Letters, 87(4):201-204, 2003. https://doi.org/10.1016/S0020-0190(03)00296-5
  27. Lei Chen, Shiyong Lu, and Jeffrey Ram. Compressed Pattern Matching in DNA Sequences. IEEE Computational Systems Bioinformatics Conference (CSB 2004), pp.62-68, 2004.
  28. Beate Commentz-Walter. A String Matching Algorithm Fast on the Average. 6th International Colloqium on Automata, Languages, and Programming, Lecture Notes in Computer Science, 71:118-132, 1979.
  29. Beate Commentz-Walter. A String Matching Algorithm Fast on the Average. Technical Report TR 79.09.007, IBM Germany, Heidelberg Scientific Center, 1979.
  30. Petri Kalsi, Hannu Peltola, and Jorma Tarhio. Comparison of Exact String matching Algorithms for Biological Sequences. 2nd International Conference, Bioinformatics Research and Development, pp.417-426, 2008.