DOI QR코드

DOI QR Code

A Primer for Disease Gene Prioritization Using Next-Generation Sequencing Data

  • Wang, Shuoguo (Department of Genetics, The State University of New Jersey) ;
  • Xing, Jinchuan (Department of Genetics, The State University of New Jersey)
  • Received : 2013.10.18
  • Accepted : 2013.11.21
  • Published : 2013.12.31

Abstract

High-throughput next-generation sequencing (NGS) technology produces a tremendous amount of raw sequence data. The challenges for researchers are to process the raw data, to map the sequences to genome, to discover variants that are different from the reference genome, and to prioritize/rank the variants for the question of interest. The recent development of many computational algorithms and programs has vastly improved the ability to translate sequence data into valuable information for disease gene identification. However, the NGS data analysis is complex and could be overwhelming for researchers who are not familiar with the process. Here, we outline the analysis pipeline and describe some of the most commonly used principles and tools for analyzing NGS data for disease gene identification.

Keywords

References

  1. Bromberg Y. Building a genome analysis pipeline to predict disease risk and prevent disease. J Mol Biol 2013;425:3993- 4005. https://doi.org/10.1016/j.jmb.2013.07.038
  2. Moreau Y, Tranchevent LC. Computational tools for prioritizing candidate genes: boosting disease gene discovery. Nat Rev Genet 2012;13:523-536. https://doi.org/10.1038/nrg3253
  3. Altmann A, Weber P, Bader D, Preuss M, Binder EB, Muller-Myhsok B. A beginners guide to SNP calling from high-throughput DNA-sequencing data. Hum Genet 2012; 131:1541-1554. https://doi.org/10.1007/s00439-012-1213-z
  4. Lyon GJ, Wang K. Identifying disease mutations in genomic medicine settings: current challenges and how to accelerate progress. Genome Med 2012;4:58. https://doi.org/10.1186/gm359
  5. Pabinger S, Dander A, Fischer M, Snajder R, Sperk M, Efremova M, et al. A survey of tools for variant analysis of next-generation genome sequencing data. Brief Bioinform 2013 Jan 21 [Epub]. http://dx.doi.org/10.1093/bib/bbs086.
  6. Ng SB, Turner EH, Robertson PD, Flygare SD, Bigham AW, Lee C, et al. Targeted capture and massively parallel sequencing of 12 human exomes. Nature 2009;461:272-276. https://doi.org/10.1038/nature08250
  7. Lee WP, Stromberg M, Ward A, Stewart C, Garrison E, Marth GT. MOSAIK: a hash-based algorithm for accurate next-generation sequencing read mapping [database]. Ithaca: arXiv, Cornell University, 2013. arXiv:1309.1149.
  8. Alkan C, Kidd JM, Marques-Bonet T, Aksay G, Antonacci F, Hormozdiari F, et al. Personalized copy number and segmental duplication maps using next-generation sequencing. Nat Genet 2009;41:1061-1067. https://doi.org/10.1038/ng.437
  9. Li H, Durbin R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 2009;25:1754- 1760. https://doi.org/10.1093/bioinformatics/btp324
  10. Langmead B, Salzberg SL. Fast gapped-read alignment with Bowtie 2. Nat Methods 2012;9:357-359. https://doi.org/10.1038/nmeth.1923
  11. Li R, Yu C, Li Y, Lam TW, Yiu SM, Kristiansen K, et al. SOAP2: an improved ultrafast tool for short read alignment. Bioinformatics 2009;25:1966-1967. https://doi.org/10.1093/bioinformatics/btp336
  12. Liao Y, Smyth GK, Shi W. The Subread aligner: fast, accurate and scalable read mapping by seed-and-vote. Nucleic Acids Res 2013;41:e108. https://doi.org/10.1093/nar/gkt214
  13. Hatem A, Bozdag D, Toland AE, Catalyurek UV. Benchmarking short sequence mapping tools. BMC Bioinformatics 2013; 14:184. https://doi.org/10.1186/1471-2105-14-184
  14. Li H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM [database]. Ithaca: arXiv, Cornell University, 2013. arXiv:1303.3997.
  15. McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A, et al. The Genome Analysis Toolkit: a Map- Reduce framework for analyzing next-generation DNA sequencing data. Genome Res 2010;20:1297-1303. https://doi.org/10.1101/gr.107524.110
  16. 1000 Genomes Project Consortium, Abecasis GR, Auton A, Brooks LD, DePristo MA, Durbin RM, et al. An integrated map of genetic variation from 1,092 human genomes. Nature 2012;491:56-65. https://doi.org/10.1038/nature11632
  17. Lam HY, Pan C, Clark MJ, Lacroute P, Chen R, Haraksingh R, et al. Detecting and annotating genetic variations using the HugeSeq pipeline. Nat Biotechnol 2012;30:226-229. https://doi.org/10.1038/nbt.2134
  18. Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics 2009;25:2078-2079. https://doi.org/10.1093/bioinformatics/btp352
  19. DePristo MA, Banks E, Poplin R, Garimella KV, Maguire JR, Hartl C, et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet 2011;43:491-498. https://doi.org/10.1038/ng.806
  20. Wang Y, Lu J, Yu J, Gibbs RA, Yu F. An integrative variant analysis pipeline for accurate genotype/haplotype inference in population NGS data. Genome Res 2013;23:833-842. https://doi.org/10.1101/gr.146084.112
  21. You N, Murillo G, Su X, Zeng X, Xu J, Ning K, et al. SNP calling using genotype model selection on high-throughput sequencing data. Bioinformatics 2012;28:643-650. https://doi.org/10.1093/bioinformatics/bts001
  22. Nielsen R, Paul JS, Albrechtsen A, Song YS. Genotype and SNP calling from next-generation sequencing data. Nat Rev Genet 2011;12:443-451. https://doi.org/10.1038/nrg2986
  23. Browning BL, Browning SR. A unified approach to genotype imputation and haplotype-phase inference for large data sets of trios and unrelated individuals. Am J Hum Genet 2009;84: 210-223. https://doi.org/10.1016/j.ajhg.2009.01.005
  24. Browning BL, Yu Z. Simultaneous genotype calling and haplotype phasing improves genotype accuracy and reduces false-positive associations for genome-wide association studies. Am J Hum Genet 2009;85:847-861. https://doi.org/10.1016/j.ajhg.2009.11.004
  25. Howie BN, Donnelly P, Marchini J. A flexible and accurate genotype imputation method for the next generation of genome- wide association studies. PLoS Genet 2009;5:e1000529. https://doi.org/10.1371/journal.pgen.1000529
  26. Li Y, Willer CJ, Ding J, Scheet P, Abecasis GR. MaCH: using sequence and genotype data to estimate haplotypes and unobserved genotypes. Genet Epidemiol 2010;34:816-834. https://doi.org/10.1002/gepi.20533
  27. Wang K, Li M, Hakonarson H. ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data. Nucleic Acids Res 2010;38:e164. https://doi.org/10.1093/nar/gkq603
  28. Quinlan AR, Hall IM. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics 2010;26:841-842. https://doi.org/10.1093/bioinformatics/btq033
  29. Reese MG, Moore B, Batchelor C, Salas F, Cunningham F, Marth GT, et al. A standard variation file format for human genome sequences. Genome Biol 2010;11:R88. https://doi.org/10.1186/gb-2010-11-8-r88
  30. Hu H, Huff CD, Moore B, Flygare S, Reese MG, Yandell M. VAAST 2.0: improved variant classification and disease-gene identification using a conservation-controlled amino acid substitution matrix. Genet Epidemiol 2013;37:622-634. https://doi.org/10.1002/gepi.21743
  31. Yandell M, Huff C, Hu H, Singleton M, Moore B, Xing J, et al. A probabilistic disease-gene finder for personal genomes. Genome Res 2011;21:1529-1542. https://doi.org/10.1101/gr.123158.111
  32. Ng PC, Henikoff S. Predicting the effects of amino acid substitutions on protein function. Annu Rev Genomics Hum Genet 2006;7:61-80. https://doi.org/10.1146/annurev.genom.7.080505.115630
  33. Henikoff S, Henikoff JG. Amino acid substitution matrices from protein blocks. Proc Natl Acad Sci U S A 1992;89:10915- 10919. https://doi.org/10.1073/pnas.89.22.10915
  34. Hubisz MJ, Pollard KS, Siepel A. PHAST and RPHAST: phylogenetic analysis with space/time models. Brief Bioinform 2011;12:41-51. https://doi.org/10.1093/bib/bbq072
  35. Adzhubei IA, Schmidt S, Peshkin L, Ramensky VE, Gerasimova A, Bork P, et al. A method and server for predicting damaging missense mutations. Nat Methods 2010;7:248-249. https://doi.org/10.1038/nmeth0410-248
  36. Bromberg Y, Rost B. SNAP: predict effect of non-synonymous polymorphisms on function. Nucleic Acids Res 2007;35: 3823-3835. https://doi.org/10.1093/nar/gkm238
  37. Olivier M. A haplotype map of the human genome. Physiol Genomics 2003;13:3-9. https://doi.org/10.1152/physiolgenomics.00178.2002
  38. International HapMap Consortium, Frazer KA, Ballinger DG, Cox DR, Hinds DA, Stuve LL, et al. A second generation human haplotype map of over 3.1 million SNPs. Nature 2007;449: 851-861. https://doi.org/10.1038/nature06258
  39. Neale BM, Rivas MA, Voight BF, Altshuler D, Devlin B, Orho-Melander M, et al. Testing for an unusual distribution of rare variants. PLoS Genet 2011;7:e1001322. https://doi.org/10.1371/journal.pgen.1001322
  40. Wu MC, Lee S, Cai T, Li Y, Boehnke M, Lin X. Rare-variant association testing for sequencing data with the sequence kernel association test. Am J Hum Genet 2011;89:82-93. https://doi.org/10.1016/j.ajhg.2011.05.029
  41. Li B, Leal SM. Methods for detecting associations with rare variants for common diseases: application to analysis of sequence data. Am J Hum Genet 2008;83:311-321. https://doi.org/10.1016/j.ajhg.2008.06.024
  42. Stitziel NO, Kiezun A, Sunyaev S. Computational and statistical approaches to analyzing variants identified by exome sequencing. Genome Biol 2011;12:227. https://doi.org/10.1186/gb-2011-12-9-227
  43. Viswanathan GA, Seto J, Patil S, Nudelman G, Sealfon SC. Getting started in biological pathway construction and analysis. PLoS Comput Biol 2008;4:e16. https://doi.org/10.1371/journal.pcbi.0040016
  44. Khatri P, Sirota M, Butte AJ. Ten years of pathway analysis: current approaches and outstanding challenges. PLoS Comput Biol 2012;8:e1002375. https://doi.org/10.1371/journal.pcbi.1002375
  45. Mi H, Muruganujan A, Thomas PD. PANTHER in 2013: modeling the evolution of gene function, and other gene attributes, in the context of phylogenetic trees. Nucleic Acids Res 2013;41:D377-D386. https://doi.org/10.1093/nar/gks1118
  46. Joshi-Tope G, Gillespie M, Vastrik I, D'Eustachio P, Schmidt E, de Bono B, et al. Reactome: a knowledgebase of biological pathways. Nucleic Acids Res 2005;33:D428-D432.
  47. Huang da W, Sherman BT, Lempicki RA. Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nat Protoc 2009;4:44-57. https://doi.org/10.1038/nprot.2008.211
  48. Goecks J, Nekrutenko A, Taylor J; Galaxy Team. Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences. Genome Biol 2010;11:R86. https://doi.org/10.1186/gb-2010-11-8-r86
  49. Blankenberg D, Von Kuster G, Coraor N, Ananda G, Lazarus R, Mangan M, et al. Galaxy: a web-based genome analysis tool for experimentalists. Curr Protoc Mol Biol 2010;Chapter 19:Unit 19.10.11-21.
  50. Giardine B, Riemer C, Hardison RC, Burhans R, Elnitski L, Shah P, et al. Galaxy: a platform for interactive large-scale genome analysis. Genome Res 2005;15:1451-1455. https://doi.org/10.1101/gr.4086505
  51. Loman NJ, Misra RV, Dallman TJ, Constantinidou C, Gharbia SE, Wain J, et al. Performance comparison of benchtop highthroughput sequencing platforms. Nat Biotechnol 2012;30: 434-439. https://doi.org/10.1038/nbt.2198
  52. Sanders SJ, Murtha MT, Gupta AR, Murdoch JD, Raubeson MJ, Willsey AJ, et al. De novo mutations revealed by whole-exome sequencing are strongly associated with autism. Nature 2012; 485:237-241. https://doi.org/10.1038/nature10945
  53. Neale BM, Kou Y, Liu L, Ma'ayan A, Samocha KE, Sabo A, et al. Patterns and rates of exonic de novo mutations in autism spectrum disorders. Nature 2012;485:242-245. https://doi.org/10.1038/nature11011
  54. O'Roak BJ, Deriziotis P, Lee C, Vives L, Schwartz JJ, Girirajan S, et al. Exome sequencing in sporadic autism spectrum disorders identifies severe de novo mutations. Nat Genet 2011; 43:585-589. https://doi.org/10.1038/ng.835

Cited by

  1. An Integrative Computational Approach for Prioritization of Genomic Variants vol.9, pp.12, 2014, https://doi.org/10.1371/journal.pone.0114903
  2. Amplicon-based semiconductor sequencing of human exomes: performance evaluation and optimization strategies vol.135, pp.5, 2016, https://doi.org/10.1007/s00439-016-1656-8
  3. Investigating the impact human protein–protein interaction networks have on disease-gene analysis pp.1868-808X, 2016, https://doi.org/10.1007/s13042-016-0503-5