DOI QR코드

DOI QR Code

Experimental Analysis of Recent Works on the Overlap Phase of De Novo Sequence Assembly

De novo 시퀀스 어셈블리의 overlap 단계의 최근 연구 실험 분석

  • 임지혁 (서울대학교 컴퓨터공학부) ;
  • 김선 (서울대학교 컴퓨터공학부) ;
  • 박근수 (서울대학교 컴퓨터공학부)
  • Received : 2017.10.24
  • Accepted : 2018.01.04
  • Published : 2018.03.15

Abstract

Given a set of DNA read sequences, de novo sequence assembly reconstructs a target sequence without a reference sequence. For reconstruction, the assembly needs the overlap phase, which computes all overlaps between every pair of reads. Since the overlap phase is the most time-consuming part of the whole assembly, the performance of the assembly depends on that of the overlap phase. There have been extensive studies on the overlap phase in various fields. Among them, three state-of-the-art results for the overlap phase are Readjoiner, SOF, and Lim-Park algorithm. Recently, a rapid development of sequencing technology has made it possible to produce a large read dataset at a low cost, and many platforms for generating a DNA read dataset have been developed. Since the platforms produce datasets with different statistical characteristics, a performance evaluation for the overlap phase should consider datasets with these characteristics. In this paper, we compare and analyze the performances of the three algorithms with various large datasets.

여러 DNA 리드 시퀀스가 주어졌을 때, de novo 시퀀스 어셈블리는 레퍼런스 시퀀스 없이 하나의 시퀀스를 재조립한다. 재조립을 위해 de novo 시퀀스 어셈블리는 리드 사이의 모든 겹침을 계산하는 overlap 단계가 필요하다. Overlap 단계는 전체 연산 중 비용이 가장 많이 들기 때문에 어셈블리의 계산 성능을 좌우한다. 여러 분야에서 overlap 단계를 위한 연구가 많이 발표되고 있는데, 그 중 가장 최신의 세 연구 결과는 Readjoiner, SOF, Lim-Park 알고리즘이다. 최근 염기 분석기술의 큰 발전으로 DNA 리드 데이터 셋을 기존보다 저비용으로 대량 생산하는 것이 가능해져 DNA 리드 데이터 셋을 생성하는 여러 플랫폼들이 개발되었다. 각 플랫폼마다 생성하는 데이터 셋의 통계적 특성이 다르기 때문에 overlap 단계의 성능 평가 시 다양한 통계적 특성의 데이터 셋이 반영되어야 한다. 본 논문은 여러 통계적 특성을 가진 DNA 리드 데이터 셋을 이용하여 위의 세 알고리즘의 성능을 비교 분석한다.

Keywords

Acknowledgement

Supported by : 한국연구재단

References

  1. E.W. Myers, "The fragment assembly string graph," Bioinformatics, Vol. 21, No. suppl 2, pp. 79-85, 2005.
  2. G. Gonnella and S. Kurtz, "Readjoiner: a fast and memory efficient string graph-based sequence assembler," BMC Bioinform., Vol. 13, No. 1, pp. 82, 2012. https://doi.org/10.1186/1471-2105-13-82
  3. D. Gusfield, G.M. Landau and B. Schieber, "An efficient algorithm for the all pairs suffix-prefix problem," Inform. Process. Lett., Vol. 41, No. 4, pp. 181-185, 1992. https://doi.org/10.1016/0020-0190(92)90176-V
  4. E. Ohlebusch and S. Gog, "Efficient algorithms for th all-pairs suffix-prefix problem and the all-pairs substring-prefix problem," Inform. Process. Lett., Vol. 110, No. 3, pp. 123-128, 2010. https://doi.org/10.1016/j.ipl.2009.10.015
  5. W.H. Tustumi, S. Gog, G.P. Telles and F.A. Louza, "An improved algorithm for the all-pairs suffixprefix problem," J. Descrete Algorithms, Vol. 2, No. 1, pp. 53-86, 2016.
  6. F.A. Louza, S. Gog, L. Zanotto, G. Araujo and G.P. Telles, "Parallel computation for the all-pairs suffixprefix problem," SPIRE 2016, pp. 122-132, 2016.
  7. J. Lim and K. Park, "A fast algorithm for the all-pairs suffix-prefix problem," Theoret. Comput. Sci., Vol. 698, pp. 14-24, 2017. https://doi.org/10.1016/j.tcs.2017.07.013
  8. D. Hernandez, P. Francois, L, Osteras and M, Schrenzel, "De novo bacterial genome sequencing: millions of very short reads assembled on a desktop computer," Genome Res., Vol. 18, No. 5, pp. 802-809, 2008. https://doi.org/10.1101/gr.072033.107
  9. J.T. Simpson and R. Durbin, "Efficient de novo assembly of large genomes using compressed data structures," Genome Res., Vol. 22, No. 3, pp. 549-556, 2012. https://doi.org/10.1101/gr.126953.111
  10. H. Dinh and S. Rajasekaran, "A memory-efficient data structure representing exact-match overlap graphs with application for next-generation DNA assembly," Bioinformatics, Vol. 27, No. 14, pp. 1901-1907, 2011. https://doi.org/10.1093/bioinformatics/btr321
  11. M.H. Rachid and Q. Malluhi, "A practical and scalable tool to find overlaps between sequences," BioMed Res. Int., Vol. 2015, 2015.
  12. http://www.illumina.com
  13. http://sequencing.roche.com/
  14. http://www.pacb.com/
  15. M.I. Abouelhoda, S. Kurtz and E. Ohlebusch, "Replacing suffix trees with enhanced suffix arrays," J. Discrete Algorithms, Vol. 2, No. 1, pp. 53-86, 2004. https://doi.org/10.1016/S1570-8667(03)00065-0
  16. T.H. Cormen, C.E. Leiserson, R.L. Rivest and C. Stein, Introduction to algorithms, 3rd Ed., MIT Press, Cambridge, MA, 2009.
  17. SRA Run Selector, https://trace.ncbi.nlm.nih.gov/Traces/study/?acc=ERP002324 (downloaded, 2017, Apr. 15)
  18. Genome Assembly Gold-standard Evaluations (GAGE), http://gage.cbcb.umd.edu/data/index.html (downloaded 2017, Apr. 15)
  19. SRA Run Selector, https://trace.ncbi.nlm.nih.gov/Traces/study/?acc=SRP029965 (downloaded, 2017 Apr. 15)
  20. SRA, https://www.ncbi.nlm.nih.gov/sra/SRX499318%5Baccn%5D (downloaded 2017, Aug. 24)
  21. SRA Run Selector, https://trace.ncbi.nlm.nih.gov/Traces/study/?acc=SRP020397 (downloaded, 2017, Apr. 15)
  22. https://github.com/PacificBiosciences/DevNet/wiki/Neurospora-Crassa-%28Fungus%29-Genome%2CEpigenome%2C-and-Transcriptome (downloaded 2017, Aug. 24)
  23. Citrus Genome Databases, https://www.citrusgemonedb.org (downloaded, 2017, Feb. 8)
  24. http://www.uni-ulm.de/in/theo/research/seqana (downloaded, 2017, Feb. 8)
  25. Ant genomics database, http://antgenomes.org (downloaded, 2017, Feb. 8)
  26. S. Wu, U. Manber, "A fast algorithm for multipattern searching," Technical report, TR-94-17, pp.1-11, 1994.