DOI QR코드

DOI QR Code

Parallel Computation For The Edit Distance Based On The Four-Russians' Algorithm

4-러시안 알고리즘 기반의 편집거리 병렬계산

  • 김영호 (인하대학교 컴퓨터정보공학과) ;
  • 정주희 (인하대학교 컴퓨터정보공학과) ;
  • 강대웅 (인하대학교 컴퓨터정보공학과) ;
  • 심정섭 (인하대학교 컴퓨터정보공학부)
  • Received : 2012.09.26
  • Accepted : 2012.12.21
  • Published : 2013.02.28

Abstract

Approximate string matching problems have been studied in diverse fields. Recently, fast approximate string matching algorithms are being used to reduce the time and costs for the next generation sequencing. To measure the amounts of errors between two strings, we use a distance function such as the edit distance. Given two strings X(|X| = m) and Y(|Y| = n) over an alphabet ${\Sigma}$, the edit distance between X and Y is the minimum number of edit operations to convert X into Y. The edit distance between X and Y can be computed using the well-known dynamic programming technique in O(mn) time and space. The edit distance also can be computed using the Four-Russians' algorithm whose preprocessing step runs in $O((3{\mid}{\Sigma}{\mid})^{2t}t^2)$ time and $O((3{\mid}{\Sigma}{\mid})^{2t}t)$ space and the computation step runs in O(mn/t) time and O(mn) space where t represents the size of the block. In this paper, we present a parallelized version of the computation step of the Four-Russians' algorithm. Our algorithm computes the edit distance between X and Y in O(m+n) time using m/t threads. Then we implemented both the sequential version and our parallelized version of the Four-Russians' algorithm using CUDA to compare the execution times. When t = 1 and t = 2, our algorithm runs about 10 times and 3 times faster than the sequential algorithm, respectively.

근사문자열매칭 문제는 다양한 분야에서 연구되어 왔다. 최근에는 차세대염기서열분석의 비용과 시간을 줄이기 위해 빠른 근사문자열매칭 알고리즘들이 이용되고 있다. 근사문자열매칭은 문자열들의 오차를 측정하기 위해 편집거리와 같은 거리함수를 이용한다. 알파벳 ${\Sigma}$에 대한 길이가 각각 m, n인 두 문자열 X와 Y의 편집거리는 X를 Y로 변환하기 위해 필요한 최소 편집연산의 수로 정의된다. 두 문자열의 편집거리는 잘 알려진 동적프로그래밍을 이용하여 O(mn) 시간과 공간에 계산할 수 있으며, 4-러시안 알고리즘을 이용해서도 계산할 수 있다. 4-러시안 알고리즘은 블록 크기를 t라 할 때, 전처리 단계에서 $O((3{\mid}{\Sigma}{\mid})^{2t}t^2)$ 시간과 $O((3{\mid}{\Sigma}{\mid})^{2t}t)$ 공간이 필요하며, 계산 단계에서 O(mn/t) 시간과 O(mn) 공간을 이용하여 편집거리를 계산하는 알고리즘이다. 본 논문에서는 4-러시안 알고리즘의 계산 단계를 병렬화하고 실험을 통해 CPU 기반의 순차적 알고리즘과 CUDA로 구현한 GPU 기반의 병렬 알고리즘의 수행시간을 비교한다. 본 논문에서 제시하는 4-러시안 알고리즘의 계산단계는 m/t개의 쓰레드를 사용하여 O(m+n) 시간에 편집거리를 계산한다. GPU 기반의 알고리즘이 CPU 기반의 알고리즘 보다 t = 1일 때 약 10배 빠르고, t = 2일 때 약 3배 빠른 결과를 보였다.

Keywords

References

  1. A. V. Aho and M. J. Corasick, "Efficient String Matching: An Aid to Bibliographic Search", Communications of the ACM, Vol.18, No.6, 1975.
  2. S. Forrest, A. S. Perelson, L. Allen, and R. Cherukuri, "Self-nonself discrimination in a computer", in Proc.IEEE Symp. Res. Security Privacy, pp.202-212, 1994.
  3. S. B. Needleman and C. D. Wunsch, "A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins", J. Mol. Biol., Vol.48, No.3, pp.443-453, 1970. https://doi.org/10.1016/0022-2836(70)90057-4
  4. T. F. Smith and M. S. Waterman, "Identification of Common Molecular Subsequences", J. Mol. Biol. 147, pp.195-197, 1981. https://doi.org/10.1016/0022-2836(81)90087-5
  5. L. L. Cheng, D. W. Cheung, and S. M. Yiu, "Approximate String Matching in DNA Sequences", Proceedings of the Eighth International Conference on Database Systems for Advanced Applications, pp.303-310, 2003.
  6. R. Li, C. Yu, Y. Li, T. W. Lam, S. M. Yui, K. Kristiansen and J. Wang, "SOAP2: an improved ultrafast tool for short read alignment", Bioinformatics, Vol.25, No.15, pp.1966-1967, 2009. https://doi.org/10.1093/bioinformatics/btp336
  7. H. Li and R. Durbin, "Fast and accurate short read alignment with Burrows-Wheeler transform", Bioinformatics, Vol.25, No.14, pp.1754-1760, 2009. https://doi.org/10.1093/bioinformatics/btp324
  8. B. Langmead, C. Trapnell, M. Pop and S. L Salzberg, "Ultrafast and memory-efficient alignment of short DNA sequences to the human genome", Genome Biology. Vol.10, R25, 2009. https://doi.org/10.1186/gb-2009-10-3-r25
  9. S. Bao, R. Jiang, W. K. Kwan, B. B. Wang, X. Ma and Y. Q. Song, "Evaluation of next-generation sequencing software in mapping and assembly", Journal of Human Genetics, Vol.56, pp.406-414, 2011. https://doi.org/10.1038/jhg.2011.43
  10. C. M. Liu, T. Wong, E. Wu, R. Luo, S. M. Yiu, Y. Li, B. Wang, C. Yu, X. Chu, K. Zhao, R. Li, T. W. Lam, "SOAP3: ultra-fast GPU-based parallel alignment tool for short reads", Bioinformatics, Vol.28, No.6, pp.878-879, 2012. https://doi.org/10.1093/bioinformatics/bts061
  11. D. Gusfield, Algorithms on Strings, Trees, and Sequences, pp.302-307, Cambridge university press, 1997.
  12. V. I. Levenshtein, "Binary codes capable of correcting deletions, insertions, and reversals", Sov. Phys. Dokl, Vol.10, pp.707-710, 1966.
  13. W. J. Masek and M. S. Paterson, "A Faster Algorithm Computing String Edit Distance", Journal of computer and system science, Vol.20, No.1, pp.18-31, 1980. https://doi.org/10.1016/0022-0000(80)90002-1
  14. V. Kundeti and S. Rajasekaran, "Extending the Four Russian Algorithm to Compute the Edit Script in Linear Space", ICCS 2008, Part I, LNCS 5101, pp.893-902, 2008.
  15. M. C. Schatz and C. Trapnell, "Fast Exact String Matching on the GPU", Technical report of Center for Bioinformatics and Computational Biology.
  16. C. Trapnell and M. C. Schatz, "Optimizing data intensive GPGPU computations for DNA sequence alignment", Parallel Computing, Vol.35, No.8-9, pp.429-440, 2009. https://doi.org/10.1016/j.parco.2009.05.002
  17. A. Gharaibeh and M. Ripeanu. "Size Matters: Space/Time Tradeoffs to Improve GPGPU Applications Performance", In IEEE/ACM Supercomputing (SC 2010), pp.1-12, 2010.
  18. Hyun Chul Yoon, Jeong Seop Sim, "Parallel Construction for the Graph Model of the Longest Common Non-superstring using CUDA", KIISE Journal, System and Theory, Vol.39, No.3, pp.202-208, 2012.
  19. L. Ligowski, W. Rudnicki, "An efficient implementation of Smith Waterman algorithm on GPU using CUDA, for massively parallel scanning of sequence databases", in Parallel & Distributed Processing (IPDPS). IEEE Int. Symp., pp.1-8, 2009.
  20. R. Hughey, "Parallel hardware for sequence comparison and alignment", Comput. Applic. Biosci. Vol.12, pp.473-479, 1996.

Cited by

  1. Parallel Algorithms for Finding Consensus of Circular Strings vol.42, pp.3, 2015, https://doi.org/10.5626/JOK.2015.42.3.289