[KSCI] Korea Science Citation Index Service

http://dx.doi.org/10.5808/GI.2013.11.4.191

A Primer for Disease Gene Prioritization Using Next-Generation Sequencing Data

Wang, Shuoguo (Department of Genetics, The State University of New Jersey)
Xing, Jinchuan (Department of Genetics, The State University of New Jersey)

Publication Information

Abstract

High-throughput next-generation sequencing (NGS) technology produces a tremendous amount of raw sequence data. The challenges for researchers are to process the raw data, to map the sequences to genome, to discover variants that are different from the reference genome, and to prioritize/rank the variants for the question of interest. The recent development of many computational algorithms and programs has vastly improved the ability to translate sequence data into valuable information for disease gene identification. However, the NGS data analysis is complex and could be overwhelming for researchers who are not familiar with the process. Here, we outline the analysis pipeline and describe some of the most commonly used principles and tools for analyzing NGS data for disease gene identification.

Keywords

disease gene prioritization; high-throughput DNA sequencing; human genome; sequence alignment; variant discovery;

Citations & Related Records

Reference

1	Giardine B, Riemer C, Hardison RC, Burhans R, Elnitski L, Shah P, et al. Galaxy: a platform for interactive large-scale genome analysis. Genome Res 2005;15:1451-1455. DOI ScienceOn
2	Loman NJ, Misra RV, Dallman TJ, Constantinidou C, Gharbia SE, Wain J, et al. Performance comparison of benchtop highthroughput sequencing platforms. Nat Biotechnol 2012;30: 434-439. DOI ScienceOn
3	Sanders SJ, Murtha MT, Gupta AR, Murdoch JD, Raubeson MJ, Willsey AJ, et al. De novo mutations revealed by whole-exome sequencing are strongly associated with autism. Nature 2012; 485:237-241. DOI ScienceOn
4	Neale BM, Kou Y, Liu L, Ma'ayan A, Samocha KE, Sabo A, et al. Patterns and rates of exonic de novo mutations in autism spectrum disorders. Nature 2012;485:242-245. DOI ScienceOn
5	O'Roak BJ, Deriziotis P, Lee C, Vives L, Schwartz JJ, Girirajan S, et al. Exome sequencing in sporadic autism spectrum disorders identifies severe de novo mutations. Nat Genet 2011; 43:585-589. DOI ScienceOn
6	Olivier M. A haplotype map of the human genome. Physiol Genomics 2003;13:3-9. DOI
7	International HapMap Consortium, Frazer KA, Ballinger DG, Cox DR, Hinds DA, Stuve LL, et al. A second generation human haplotype map of over 3.1 million SNPs. Nature 2007;449: 851-861. DOI ScienceOn
8	Neale BM, Rivas MA, Voight BF, Altshuler D, Devlin B, Orho-Melander M, et al. Testing for an unusual distribution of rare variants. PLoS Genet 2011;7:e1001322. DOI ScienceOn
9	Stitziel NO, Kiezun A, Sunyaev S. Computational and statistical approaches to analyzing variants identified by exome sequencing. Genome Biol 2011;12:227. DOI
10	Wu MC, Lee S, Cai T, Li Y, Boehnke M, Lin X. Rare-variant association testing for sequencing data with the sequence kernel association test. Am J Hum Genet 2011;89:82-93. DOI ScienceOn
11	Viswanathan GA, Seto J, Patil S, Nudelman G, Sealfon SC. Getting started in biological pathway construction and analysis. PLoS Comput Biol 2008;4:e16. DOI ScienceOn
12	Khatri P, Sirota M, Butte AJ. Ten years of pathway analysis: current approaches and outstanding challenges. PLoS Comput Biol 2012;8:e1002375. DOI
13	Mi H, Muruganujan A, Thomas PD. PANTHER in 2013: modeling the evolution of gene function, and other gene attributes, in the context of phylogenetic trees. Nucleic Acids Res 2013;41:D377-D386. DOI
14	Joshi-Tope G, Gillespie M, Vastrik I, D'Eustachio P, Schmidt E, de Bono B, et al. Reactome: a knowledgebase of biological pathways. Nucleic Acids Res 2005;33:D428-D432.
15	Huang da W, Sherman BT, Lempicki RA. Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nat Protoc 2009;4:44-57. DOI
16	Goecks J, Nekrutenko A, Taylor J; Galaxy Team. Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences. Genome Biol 2010;11:R86. DOI ScienceOn
17	Blankenberg D, Von Kuster G, Coraor N, Ananda G, Lazarus R, Mangan M, et al. Galaxy: a web-based genome analysis tool for experimentalists. Curr Protoc Mol Biol 2010;Chapter 19:Unit 19.10.11-21.
18	Howie BN, Donnelly P, Marchini J. A flexible and accurate genotype imputation method for the next generation of genome- wide association studies. PLoS Genet 2009;5:e1000529. DOI ScienceOn
19	Li Y, Willer CJ, Ding J, Scheet P, Abecasis GR. MaCH: using sequence and genotype data to estimate haplotypes and unobserved genotypes. Genet Epidemiol 2010;34:816-834. DOI ScienceOn
20	Wang K, Li M, Hakonarson H. ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data. Nucleic Acids Res 2010;38:e164. DOI
21	Quinlan AR, Hall IM. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics 2010;26:841-842. DOI ScienceOn
22	Reese MG, Moore B, Batchelor C, Salas F, Cunningham F, Marth GT, et al. A standard variation file format for human genome sequences. Genome Biol 2010;11:R88. DOI ScienceOn
23	Hu H, Huff CD, Moore B, Flygare S, Reese MG, Yandell M. VAAST 2.0: improved variant classification and disease-gene identification using a conservation-controlled amino acid substitution matrix. Genet Epidemiol 2013;37:622-634. DOI ScienceOn
24	Yandell M, Huff C, Hu H, Singleton M, Moore B, Xing J, et al. A probabilistic disease-gene finder for personal genomes. Genome Res 2011;21:1529-1542. DOI ScienceOn
25	Ng PC, Henikoff S. Predicting the effects of amino acid substitutions on protein function. Annu Rev Genomics Hum Genet 2006;7:61-80. DOI
26	Henikoff S, Henikoff JG. Amino acid substitution matrices from protein blocks. Proc Natl Acad Sci U S A 1992;89:10915- 10919. DOI ScienceOn
27	Hubisz MJ, Pollard KS, Siepel A. PHAST and RPHAST: phylogenetic analysis with space/time models. Brief Bioinform 2011;12:41-51. DOI ScienceOn
28	Adzhubei IA, Schmidt S, Peshkin L, Ramensky VE, Gerasimova A, Bork P, et al. A method and server for predicting damaging missense mutations. Nat Methods 2010;7:248-249. DOI ScienceOn
29	Bromberg Y, Rost B. SNAP: predict effect of non-synonymous polymorphisms on function. Nucleic Acids Res 2007;35: 3823-3835. DOI
30	Li R, Yu C, Li Y, Lam TW, Yiu SM, Kristiansen K, et al. SOAP2: an improved ultrafast tool for short read alignment. Bioinformatics 2009;25:1966-1967. DOI ScienceOn
31	McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A, et al. The Genome Analysis Toolkit: a Map- Reduce framework for analyzing next-generation DNA sequencing data. Genome Res 2010;20:1297-1303. DOI ScienceOn
32	Liao Y, Smyth GK, Shi W. The Subread aligner: fast, accurate and scalable read mapping by seed-and-vote. Nucleic Acids Res 2013;41:e108. DOI ScienceOn
33	Hatem A, Bozdag D, Toland AE, Catalyurek UV. Benchmarking short sequence mapping tools. BMC Bioinformatics 2013; 14:184. DOI ScienceOn
34	Li H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM [database]. Ithaca: arXiv, Cornell University, 2013. arXiv:1303.3997.
35	1000 Genomes Project Consortium, Abecasis GR, Auton A, Brooks LD, DePristo MA, Durbin RM, et al. An integrated map of genetic variation from 1,092 human genomes. Nature 2012;491:56-65. DOI ScienceOn
36	Lam HY, Pan C, Clark MJ, Lacroute P, Chen R, Haraksingh R, et al. Detecting and annotating genetic variations using the HugeSeq pipeline. Nat Biotechnol 2012;30:226-229. DOI ScienceOn
37	Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics 2009;25:2078-2079. DOI ScienceOn
38	DePristo MA, Banks E, Poplin R, Garimella KV, Maguire JR, Hartl C, et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet 2011;43:491-498. DOI ScienceOn
39	You N, Murillo G, Su X, Zeng X, Xu J, Ning K, et al. SNP calling using genotype model selection on high-throughput sequencing data. Bioinformatics 2012;28:643-650. DOI ScienceOn
40	Nielsen R, Paul JS, Albrechtsen A, Song YS. Genotype and SNP calling from next-generation sequencing data. Nat Rev Genet 2011;12:443-451. DOI ScienceOn
41	Browning BL, Browning SR. A unified approach to genotype imputation and haplotype-phase inference for large data sets of trios and unrelated individuals. Am J Hum Genet 2009;84: 210-223. DOI ScienceOn
42	Pabinger S, Dander A, Fischer M, Snajder R, Sperk M, Efremova M, et al. A survey of tools for variant analysis of next-generation genome sequencing data. Brief Bioinform 2013 Jan 21 [Epub]. http://dx.doi.org/10.1093/bib/bbs086.
43	Bromberg Y. Building a genome analysis pipeline to predict disease risk and prevent disease. J Mol Biol 2013;425:3993- 4005. DOI ScienceOn
44	Moreau Y, Tranchevent LC. Computational tools for prioritizing candidate genes: boosting disease gene discovery. Nat Rev Genet 2012;13:523-536. DOI ScienceOn
45	Lyon GJ, Wang K. Identifying disease mutations in genomic medicine settings: current challenges and how to accelerate progress. Genome Med 2012;4:58. DOI ScienceOn
46	Ng SB, Turner EH, Robertson PD, Flygare SD, Bigham AW, Lee C, et al. Targeted capture and massively parallel sequencing of 12 human exomes. Nature 2009;461:272-276. DOI ScienceOn
47	Lee WP, Stromberg M, Ward A, Stewart C, Garrison E, Marth GT. MOSAIK: a hash-based algorithm for accurate next-generation sequencing read mapping [database]. Ithaca: arXiv, Cornell University, 2013. arXiv:1309.1149.
48	Alkan C, Kidd JM, Marques-Bonet T, Aksay G, Antonacci F, Hormozdiari F, et al. Personalized copy number and segmental duplication maps using next-generation sequencing. Nat Genet 2009;41:1061-1067. DOI ScienceOn
49	Li H, Durbin R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 2009;25:1754- 1760. DOI ScienceOn
50	Langmead B, Salzberg SL. Fast gapped-read alignment with Bowtie 2. Nat Methods 2012;9:357-359. DOI ScienceOn
51	Wang Y, Lu J, Yu J, Gibbs RA, Yu F. An integrative variant analysis pipeline for accurate genotype/haplotype inference in population NGS data. Genome Res 2013;23:833-842. DOI ScienceOn
52	Li B, Leal SM. Methods for detecting associations with rare variants for common diseases: application to analysis of sequence data. Am J Hum Genet 2008;83:311-321. DOI ScienceOn
53	Browning BL, Yu Z. Simultaneous genotype calling and haplotype phasing improves genotype accuracy and reduces false-positive associations for genome-wide association studies. Am J Hum Genet 2009;85:847-861. DOI ScienceOn
54	Altmann A, Weber P, Bader D, Preuss M, Binder EB, Muller-Myhsok B. A beginners guide to SNP calling from high-throughput DNA-sequencing data. Hum Genet 2012; 131:1541-1554. DOI ScienceOn

12	(2014) PLoS ONE An Integrative Computational Approach for Prioritization of Genomic Variants / 9 (12) , e114903
5	(2016) Human Genetics Amplicon-based semiconductor sequencing of human exomes: performance evaluation and optimization strategies / 135 (5) , 499
1868-808X	(2016) International Journal of Machine Learning and Cybernetics Investigating the impact human protein–protein interaction networks have on disease-gene analysis / (1868-808X)