DOI QR코드

DOI QR Code

Accelerating next generation sequencing data analysis: an evaluation of optimized best practices for Genome Analysis Toolkit algorithms

  • Franke, Karl R. (Department of Pediatrics, Nemours Alfred I duPont Hospital for Children) ;
  • Crowgey, Erin L. (Department of Pediatrics, Nemours Alfred I duPont Hospital for Children)
  • Received : 2020.02.26
  • Accepted : 2020.03.06
  • Published : 2020.03.31

Abstract

Advancements in next generation sequencing (NGS) technologies have significantly increased the translational use of genomics data in the medical field as well as the demand for computational infrastructure capable processing that data. To enhance the current understanding of software and hardware used to compute large scale human genomic datasets (NGS), the performance and accuracy of optimized versions of GATK algorithms, including Parabricks and Sentieon, were compared to the results of the original application (GATK V4.1.0, Intel x86 CPUs). Parabricks was able to process a 50× whole-genome sequencing library in under 3 h and Sentieon finished in under 8 h, whereas GATK v4.1.0 needed nearly 24 h. These results were achieved while maintaining greater than 99% accuracy and precision compared to stock GATK. Sentieon's somatic pipeline achieved similar results greater than 99%. Additionally, the IBM POWER9 CPU performed well on bioinformatic workloads when tested with 10 different tools for alignment/mapping.

Keywords

References

  1. McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A, et al. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res 2010;20:1297-1303. https://doi.org/10.1101/gr.107524.110
  2. DePristo MA, Banks E, Poplin R, Garimella KV, Maguire JR, Hartl C, et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet 2011;43:491-498. https://doi.org/10.1038/ng.806
  3. Van der Auwera GA, Carneiro MO, Hartl C, Poplin R, Del Angel G, Levy-Moonshine A, et al. From FastQ data to high confidence variant calls: the Genome Analysis Toolkit best practices pipeline. Curr Protoc Bioinformatics 2013;43:11.10.1-33.
  4. Freed D, Aldana R, Weber JA, Edwards JS. The Sentieon Genomics Tools: a fast and accurate solution to variant calling from next-generation sequence data. Preprint at https://doi.org/10.1101/115717 (2017).
  5. Heldenbrand JR, Baheti S, Bockol MA, Drucker TM, Hart SN, Hudson ME, et al. Performance benchmarking of GATK3.8 and GATK4. Preprint at https://doi.org/10.1101/348565 (2018).
  6. Dobin A, Davis CA, Schlesinger F, Drenkow J, Zaleski C, Jha S, et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics 2013;29:15-21. https://doi.org/10.1093/bioinformatics/bts635
  7. Kim D, Pertea G, Trapnell C, Pimentel H, Kelley R, Salzberg SL. TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions. Genome Biol 2013;14:R36. https://doi.org/10.1186/gb-2013-14-4-r36
  8. Kim D, Langmead B, Salzberg SL. HISAT: a fast spliced aligner with low memory requirements. Nat Methods 2015;12:357-360. https://doi.org/10.1038/nmeth.3317
  9. Bushnell B. BBMap. SourceForge, 2019. Accessed 2019 May 21. Available from: http://www.fda.gov/downloads/Drugs/.../Guidances/UCM174090.pdf.
  10. Li H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. preprint http://arxiv.org/abs/1303.3997 (2013).
  11. Langmead B, Trapnell C, Pop M, Salzberg SL. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol 2009;10:R25. https://doi.org/10.1186/gb-2009-10-3-r25
  12. Langmead B, Salzberg SL. Fast gapped-read alignment with Bowtie 2. Nat Methods 2012;9:357-359. https://doi.org/10.1038/nmeth.1923
  13. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. J Mol Biol 1990;215:403-410. https://doi.org/10.1016/S0022-2836(05)80360-2
  14. Wang M, Kong L. pblat: a multithread blat algorithm speeding up aligning sequences to genomes. BMC Bioinformatics 2019;20:28. https://doi.org/10.1186/s12859-019-2597-8
  15. Eberle MA, Fritzilas E, Krusche P, Kallberg M, Moore BL, Bekritsky MA, et al. A reference data set of 5.4 million phased human variants validated by genetic inheritance from sequencing a three-generation 17-member pedigree. Genome Res 2017;27:157-164. https://doi.org/10.1101/gr.210500.116
  16. Cuppen E, Nijman I, Chapman B. NA12878/NA24385 tumor-like mixture. NA12878/NA24385 tumor-like mixture. Accessed 2019 May 20. Available from: ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/use_cases/mixtures/UMCUTRECHT_NA12878_NA24385_mixture_10052016/.
  17. Kent WJ. BLAT: the BLAST-like alignment tool. Genome Res 2002;12:656-664. https://doi.org/10.1101/gr.229202
  18. Kendig KI, Baheti S, Bockol MA, Drucker TM, Hart SN, Heldenbrand JR, et al. Sentieon DNASeq variant calling workflow demonstrates strong computational performance and accuracy. Front Genet 2019;10:736. https://doi.org/10.3389/fgene.2019.00736