DOI QR코드

DOI QR Code

Parallel Approximate String Matching with k-Mismatches for Multiple Fixed-Length Patterns in DNA Sequences on Graphics Processing Units

GPU을 이용한 다중 고정 길이 패턴을 갖는 DNA 시퀀스에 대한 k-Mismatches에 의한 근사적 병열 스트링 매칭

  • Ho, ThienLuan (Dept. of Electronics and Electrical Eng., Dankook University) ;
  • Kim, HyunJin (Dept. of Electronics and Electrical Eng., Dankook University) ;
  • Oh, SeungRohk (Dept. of Electronics and Electrical Eng., Dankook University)
  • Received : 2016.08.17
  • Accepted : 2017.05.02
  • Published : 2017.06.01

Abstract

In this paper, we propose a parallel approximate string matching algorithm with k-mismatches for multiple fixed-length patterns (PMASM) in DNA sequences. PMASM is developed from parallel single pattern approximate string matching algorithms to effectively calculate the Hamming distances for multiple patterns with a fixed-length. In the preprocessing phase of PMASM, all target patterns are binary encoded and stored into a look-up memory. With each input character from the input string, the Hamming distances between a substring and all patterns can be updated at the same time based on the binary encoding information in the look-up memory. Moreover, PMASM adopts graphics processing units (GPUs) to process the data computations in parallel. This paper presents three kinds of PMASM implementation methods in GPUs: thread PMASM, block-thread PMASM, and shared-mem PMASM methods. The shared-mem PMASM method gives an example to effectively make use of the GPU parallel capacity. Moreover, it also exploits special features of the CUDA (Compute Unified Device Architecture) memory structure to optimize the performance. In the experiments with DNA sequences, the proposed PMASM on GPU is 385, 77, and 64 times faster than the traditional naive algorithm, the shift-add algorithm and the single thread PMASM implementation on CPU. With the same NVIDIA GPU model, the performance of the proposed approach is enhanced up to 44% and 21%, compared with the naive, and the shift-add algorithms.

Keywords

References

  1. S. Li, Q. Jiang, and D. Wei, "An optimized algorithm for finding approximate tandem repeats in DNA sequences," Proc. 2nd Int. Workshop on Education Technology and Computer Science (ETCS), IEEE, vol. 3, pp. 68-71, 2010.
  2. L.-L. Cheng, D. W. Cheung, and S.-M. Yiu, "Approximate string matching in DNA sequences," Proc. 8th Int. Conf. Database Systems for Advanced Applications, (DASFAA), IEEE, pp. 303-310, 2003.
  3. K. Inoue, et al, "Application of approximate pattern matching in two dimensional spaces to grid layout for biochemical network maps," PloS One, vol. 7, no. 6, p. e37739, 2012. https://doi.org/10.1371/journal.pone.0037739
  4. Y. Liu, L. Guo, J. Li, M. Ren, and K. Li, "Parallel algorithms for approximate string matching with k-mismatches on CUDA," Proc. 26th Int. Parallel and Distributed Processing Symposium Workshops & PhD Forum (IPDPSW), IEEE, pp. 2414-2422, 2012.
  5. G. Navarro, "A guided tour to approximate string matching," ACM computing surveys (CSUR), vol. 33, no. 1, pp. 31-88, 2001. https://doi.org/10.1145/375360.375365
  6. K. Xu, W. Cui, Y. Hu, L. Guo, "Bit-parallel multiple approximate string matching based on GPU," Procedia Computer Science, vol. 17, pp. 523-529, 2013. https://doi.org/10.1016/j.procs.2013.05.067
  7. Z. Galil and R. Giancarlo, "Improved string matching with k-mismatches," ACM SIGACT News, vol. 17, no. 4, pp. 52-54, 1986. https://doi.org/10.1145/8307.8309
  8. R. Baeza-Yates and G. H. Gonnet, "A new approach to text searching," Communications of the ACM, vol. 35, no. 10, pp. 74-82, 1992.
  9. K. Abrahamson, "Generalized string matching," SIAM Journal on Computing (SICOMP), vol. 16, no. 6, pp. 1039-1051, 1987. https://doi.org/10.1137/0216067
  10. A. Amir, M. Lewenstein, and E. Porat, "Faster algorithms for string matching with k-mismatches," Journal of Algorithms, vol. 50, no. 2, pp. 257-275, 2004. https://doi.org/10.1016/S0196-6774(03)00097-X
  11. M. Nicolae, et al, "On string matching with mismatches," Algorithms, vol. 8, no. 2, pp. 248-270, 2015. https://doi.org/10.3390/a8020248
  12. DNA patterns, Bioinformatics;. Available: http://www.bioinformatics.org/sms2/dna_pattern.html
  13. Saccharomyces Genome Database;. Available: http://downloads.yeastgenome.org/sequence/S288C_reference/orf_dna.
  14. GeForce GTX 660;. Available: http://www.geforce.com/hardware/desktop-gpus/geforce-gtx-660.
  15. Intel Xeon CPU E31270;. Available: http://ark.intel.com/products/52276/Intel-Xeon-Processor-E3-1270-8M-Cache-3_40-GHz.