Browse > Article
http://dx.doi.org/10.3745/JIPS.04.0023

A Comprehensive Review of Emerging Computational Methods for Gene Identification  

Yu, Ning (Dept. of Computer Science, Georgia State University)
Yu, Zeng (Dept. of Computer Science, Georgia State University)
Li, Bing (Dept. of Computer Science, Georgia State University)
Gu, Feng (Dept. of Computer Science, College of Staten Island, City University of New York)
Pan, Yi (Dept. of Computer Science, Georgia State University)
Publication Information
Journal of Information Processing Systems / v.12, no.1, 2016 , pp. 1-34 More about this Journal
Abstract
Gene identification is at the center of genomic studies. Although the first phase of the Encyclopedia of DNA Elements (ENCODE) project has been claimed to be complete, the annotation of the functional elements is far from being so. Computational methods in gene identification continue to play important roles in this area and other relevant issues. So far, a lot of work has been performed on this area, and a plethora of computational methods and avenues have been developed. Many review papers have summarized these methods and other related work. However, most of them focus on the methodologies from a particular aspect or perspective. Different from these existing bodies of research, this paper aims to comprehensively summarize the mainstream computational methods in gene identification and tries to provide a short but concise technical reference for future studies. Moreover, this review sheds light on the emerging trends and cutting-edge techniques that are believed to be capable of leading the research on this field in the future.
Keywords
Cloud Computing; Comparative Methods; Deep Learning; Fourier Transform; Gene Identification; Gene Prediction; Hidden Markov Model; Machine Learning; Protein-Coding Region; Support Vector Machine;
Citations & Related Records
연도 인용수 순위
  • Reference
1 R. Ranawana and V. Palade, "A neural network based multi-classifier system for gene identification in DNA sequences," Neural Computing & Applications, vol. 14, no. 2, pp. 122-131, 2005.   DOI
2 Y. Xu, J. R. Einstein, R. Mural, M. Shah, and E. C. Uberbacher, "An improved system for exon recognition and gene modeling in human DNA sequences," in Proceedings of the 2nd International Conference on Intelligent Systems for Molecular Biology, San Francisco, CA, 1994, pp. 376-384.
3 L. Roberts, N. Steele, C. Reeves, and G. King, "Training neural networks to identify coding regions in genomic DNA," in Proceedings of the 4th International Conference on Artificial Neural Networks, Cambridge, UK, 1995, pp. 399-403.
4 E. E. Snyder and G. D. Stormo, "Identification of coding regions in genomic DNA sequences: an application of dynamic programming and neural networks." Nucleic Acids Research, vol. 21, no. 3, p. 607-613, 1993.   DOI
5 Y. Xu, R. Mural, J. Einstein, M. Shah, and E. Uberbacher, "GRAIL: a multi-agent neural network system for gene identification," Proceedings of the IEEE, vol. 84, no. 10, pp. 1544-1552, 1996.
6 J. Hertz, A. Krogh, and R. G. Palmer, Introduction to the Theory of Neural Computation. Redwood City, CA: Addison-Wesley, 1991.
7 C. Li, P. He, and J. Wang, "Artificial neural network method for predicting protein-coding genes in the yeast genome," Internet Electronic Journal of Molecular Design, vol. 2, pp. 527-538, 2003.
8 M. K. K. Leung, H. Y. Xiong, L. J. Lee, and B. J. Frey, "Deep learning of the tissue-regulated splicing code," Bioinformatics, vol. 30, no. 12, pp. i121-i129, 2014.   DOI
9 Y. Bengio, A. Courville, and P. Vincent, "Representation learning: a review and new perspectives," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 8, pp. 1798-1828, 2013.   DOI
10 P. Ramachandran and A. Antoniou, "Identification of hot-spot locations in proteins using digital filters," IEEE Journal of Selected Topics in Signal Processing, vol. 2, no. 3, pp. 378-389, 2008.   DOI
11 M. Sardaraz, M. Tahir, A. A. Ikram, and H. Bajwa, "SeqCompress: an algorithm for biological sequence compression," Genomics, vol. 104, no. 4, pp. 225-228, 2014.   DOI
12 L. Krause, A. C. McHardy, T. W. Nattkemper, A. Phler, J. Stoye, and F. Meyer, "GISMO: gene identification using a support vector machine for ORF classification," Nucleic Acids Research, vol. 35, no. 2, pp. 540-549, 2007.   DOI
13 K. Vervier, P. Mathé, M. Tournoud, J. B. Veyrieras, and J. P. Vert, "Large-scale machine learning for metagenomics sequence classification," Bioinformatics, 2015, http://dx.doi.org/10.1093/bioinformatics/btv683.
14 M. Welling, "Are machine learning and statistics complementary?" Dec. 2015; https://www.ics.uci.edu/-welling/publications/papers/WhyMLneedsStatistics.pdf.
15 P. Di Lena, K. Nagata, and P. Baldi, "Deep architectures for protein contact map prediction," Bioinformatics, vol. 28, no. 19, pp. 2449-2457, 2012.   DOI
16 G. Hinton, P. Dayan, B. Frey, and R. Neal, "The 'wake-sleep' algorithm for unsupervised neural networks," Science, vol. 268, no. 5214, pp. 1158-1161, 1995.   DOI
17 G. E. Hintonemail, "Learning multiple layers of representation," Trends in Cognitive Sciences, vol. 11, no. 10, pp. 428-434, 2007.   DOI
18 L. Deng, G. Hinton, and B. Kingsbury, "New types of deep neural network learning for speech recognition and related applications: an overview," in Proceedings of 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vancouver, BC, 2013, pp. 8599-8603.
19 J. Eickholt and J. Cheng, "Predicting protein residue-residue contacts using deep networks and boosting," Bioinformatics, vol. 28, no. 23, pp. 3066-3072, 2012.   DOI
20 A. Ben-Hur, C. S. Ong, S. Sonnenburg, B. Scholkopf, and G. Ratsch, "Support vector machines and kernels for computational biology," PLoS Computational Biology, vol. 4, no. 10, article ID. e1000173, 2008.
21 A. Zien, G. Rätsch, S. Mika, B. Schölkopf, T. Lengauer, and K. R. Muller, "Engineering support vector machine kernels that recognize translation initiation sites," Bioinformatics, vol. 16, no. 9, pp. 799-807, 2000.   DOI
22 S. Sonnenburg, A. Zien, and G. Ratsch, "ARTS: accurate recognition of transcription starts in human," Bioinformatics, vol. 22, no. 14, pp. e472-e480, 2006.   DOI
23 S. Sonnenburg, G. Schweikert, P. Philips, J. Behr, and G. Ratsch, "Accurate splice site prediction using support vector machines," BMC Bioinformatics, vol. 8, no. Suppl 10, article ID. S7, 2007.
24 C. Cortes and V. Vapnik, "Support-vector networks," Machine Learning, vol. 20, no. 3, pp. 273-297, 1995.   DOI
25 H. Liu, H. Han, J. Li, and L. Wong, "An in-silico method for prediction of polyadenylation signals in human sequences," Genome Informatics, vol. 14, pp. 84-93, 2003.
26 B. Scholkopf and A. J. Smola, Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. Cambridge, MA: MIT Press, 2002.
27 G. Ratsch and S. Sonnenburg, "Large scale hidden semi-Markov SVMs," in Advances in Neural Information Processing Systems. Cambridge, MA: MIT Press, 2007, pp. 1161-1168.
28 C. Yu, M. Deng, L. Zheng, R. L. He, J. Yang, and S. S. T. Yau, "DFA7, a new method to distinguish between intron-containing and intronless genes," PLoS ONE, vol. 9, no. 7, article ID. e101363, 2014.
29 Y. Liu, J. Guo, G. Hu, and H. Zhu, "Gene prediction in metagenomic fragments based on the SVM algorithm," BMC Bioinformatics, vol. 14, no. Suppl 5, article ID. S12, 2013.
30 C. Leslie, E. Eskin, and W. S. Noble, "The spectrum kernel: a string kernel for SVM protein classification," Pacific Symposium on Biocomputing, vol. 7, pp. 564-575, 2002.
31 G. Ratsch, S. Sonnenburg, and B. Scholkopf, "RASE: recognition of alternatively spliced exons in C. elegans," Bioinformatics, vol. 21, no. Suppl 1, pp. i369-i377, 2005.   DOI
32 S. Sonnenburg, G. Rätsch, C. Schafer, and B. Scholkopf, "Large scale multiple kernel learning," Journal of Machine Learning Research, vol. 7, pp. 1531-1565, 2006.
33 C. S. Leslie, E. Eskin, A. Cohen, J. Weston, and W. S. Noble, "Mismatch string kernels for discriminative protein classification," Bioinformatics, vol. 20, no. 4, pp. 467-476, 2004.   DOI
34 L. Liao and W. S. Noble, "Combining pairwise sequence similarity and support vector machines for detecting remote protein evolutionary and structural relationships," Journal of Computational Biology, vol. 10, no. 6, pp. 857-868, 2003.   DOI
35 P. Meinicke, M. Tech, B. Morgenstern, and R. Merkl, "Oligo kernels for data mining on biological sequences: a case study on prokaryotic translation initiation sites," BMC Bioinformatics, vol. 5, article ID. 169, 2004.
36 D. Haussler, "Convolution kernels on discrete structures," University of California at Santa Cruz, CA, Technical Report UCS-CRL-99-10, 1999.
37 L. Sun, H. Luo, D. Bu, G. Zhao, K. Yu, C. Zhang, Y. Liu, R. Chen, and Y. Zhao, "Utilizing sequence intrinsic composition to classify protein-coding and long non-coding transcripts," Nucleic Acids Research, vol. 41, no. 17, article ID. e166, 2013.
38 H. Saigo, J. P. Vert, N. Ueda, and T. Akutsu, "Protein homology detection using string alignment kernels," Bioinformatics, vol. 20, no. 11, pp. 1682-1689, 2004.   DOI
39 J. Vert, H. Saigo, and T. Akutsu, "Local alignment kernels for biological sequences," in Kernel Methods in Computational Biology, B. Scholkopf, K. Tsuda, and J. P. Vert, Eds. Cambridge, MA: MIT Press, 2004, pp. 131- 154.
40 K. Tsuda, M. Kawanabe, G. Rtsch, S. Sonnenburg, and K. R. Muller, "A new discriminative kernel from probabilistic models," Neural Computation, vol. 14, no. 10, pp. 2397-2414, 2002.   DOI
41 M. Seeger, "Covariance kernels from Bayesian generative models," in Advances in Neural Information Processing Systems. Cambridge, MA: MIT Press, 2002, pp. 905-912.
42 K. Tsuda, T. Kin, and K. Asai, "Marginalized kernels for biological sequences," Bioinformatics, vol. 18, no. Suppl 1, pp. S268-S275, 2002.   DOI
43 S. Schwartz, W. J. Kent, A. Smit, Z. Zhang, R. Baertsch, R. C. Hardison, D. Haussler, and W. Miller, "Humanmouse alignments with BLASTZ," Genome Research, vol. 13, no. 1, pp. 103-107, 2003.   DOI
44 G. Schweikert, A. Zien, G. Zeller, J. Behr, C. Dieterich, C. S. Ong, et al., "mGENE: accurate svm-based gene finding with an application to nematode genomes," Genome Research, vol. 19, no. 11, pp. 2133-2143, 2009.   DOI
45 U. Kamath, K. De Jong, and A. Shehu, "Effective automated feature construction and selection for classification of biological sequences," PLoS ONE, vol. 9, no. 7, article ID. e99982, 2014.
46 R. Zhang and C. T. Zhang, "Z curves, an intuitive tool for visualizing and analyzing the DNA sequences," Journal of Biomolecular Structure and Dynamics, vol. 11, no. 4, pp. 767-782, 1994.   DOI
47 S. F. Altschul, T. L. Madden, A. A. Schaffer, J. Zhang, Z. Zhang, W. Miller, and D. J. Lipman, "Gapped BLAST and PSI-BLAST: a new generation of protein database search programs," Nucleic Acids Research, vol. 25, no. 17, pp. 3389-3402, 1997.   DOI
48 B. Ma, J. Tromp, and M. Li, "PatternHunter: faster and more sensitive homology search," Bioinformatics, vol. 18, no. 3, pp. 440-445, 2002.   DOI
49 M. Chaisson and G. Tesler, "Mapping single molecule sequencing reads using basic local alignment with successive refinement (BLASR): application and theory," BMC Bioinformatics, vol. 13, no. 1, article ID. 238, 2012.
50 T. Wiehe, S. Gebauer-Jung, T. Mitchell-Olds, and R. Guigo, "SGP-1: prediction and validation of homologous genes based on sequence alignments," Genome Research, vol. 11, no. 9, pp. 1574-1583, 2001.   DOI
51 R. A. Cartwright, "Ngila: global pairwise alignments with logarithmic and affine gap costs," Bioinformatics, vol. 23, no. 11, pp. 1427-1428, 2007.   DOI
52 R. Guigo, E. T. Dermitzakis, P. Agarwal, C. P. Ponting, G. Parra, A. Reymond, et al., "Comparison of mouse and human genomes followed by experimental verification yields an estimated 1,019 additional genes," Proceedings of the National Academy of Sciences, vol. 100, no. 3, pp. 1140-1145, 2003.
53 S. Batzoglou, L. Pachter, J. P. Mesirov, B. Berger, and E. S. Lander, "Human and mouse gene structure: comparative analysis and application to exon prediction," Genome Research, vol. 10, no. 7, pp. 950-958, 2000.   DOI
54 S. Kurtz, A. Phillippy, A. Delcher, M. Smoot, M. Shumway, C. Antonescu, and S. Salzberg, "Versatile and open software for comparing large genomes," Genome Biology, vol. 5, no. 2, article ID. R12, 2004.
55 V. Bafna and D. H. Huson, "The conserved exon method for gene finding," in Proceedings of the 8th International Conference on Intelligent Systems for Molecular Biology, La Jolla, CA, 2000, pp. 3-12.
56 P. S. Novichkov, M. S. Gelfand, and A. A. Mironov, "Gene recognition in eukaryotic DNA by comparison of genomic sequences," Bioinformatics, vol. 17, no. 11, pp. 1011-1018, 2001.   DOI
57 P. Blayo, P. Rouzé, and M. F. Sagot, "Orphan gene finding: an exon assembly approach," Theoretical Computer Science, vol. 290, no. 3, pp. 1407-1431, 2003.   DOI
58 S. Altschul, W. Gish, W. Miller, E. Myers, and D. Lipman, "Basic local alignment search tool," Journal of Molecular Biology, vol. 215, no. 3, pp. 403-410, 1990.   DOI
59 X. Huang, M. D. Adams, H. Zhou, and A. R. Kerlavage, "A tool for analyzing and annotating genomic sequences," Genomics, vol. 46, no. 1, pp. 37-45, 1997.   DOI
60 L. Florea, G. Hartzell, Z. Zhang, G. M. Rubin, and W. Miller, "A computer program for aligning a cDNA sequence with a genomic DNA sequence," Genome Research, vol. 8, no. 9, pp. 967-974, 1998.
61 S. J. Wheelan, D. M. Church, and J. M. Ostell, "Spidey: a tool for mRNA-to-genomic alignments," Genome Research, vol. 11, no. 11, pp. 1952-1957, 2001.   DOI
62 Y. Fukunishi, H. Suzuki, M. Yoshino, H. Konno, and Y. Hayashizaki, "Prediction of human cDNA from its homologous mouse full-length cDNA and human shotgun database," FEBS Letters, vol. 464, no. 3, pp. 129- 132, 1999.   DOI
63 J. Jiang and H. J. Jacob, "EbEST: an automated tool using expressed sequence tags to delineate gene structure," Genome Research, vol. 8, no. 3, pp. 268-275, 1998.
64 R. Mott, "EST-GENOME: a program to align spliced DNA sequences to unspliced genomic DNA," Computer Applications in the Biosciences (CABIOS), vol. 13, no. 4, pp. 477-478, 1997.
65 Z. Kan, E. C. Rouchka, W. R. Gish, and D. J. States, "Gene structure prediction and alternative splicing analysis using genomically aligned ESTs," Genome Research, vol. 11, no. 5, pp. 889-900, 2001.   DOI
66 X. J. Min, G. Butler, R. Storms, and A. Tsang, "OrfPredictor: predicting protein-coding regions in EST-derived sequences," Nucleic Acids Research, vol. 33, no. Suppl 2, pp. W677-W680, 2005.   DOI
67 M. L. Metzker, "Sequencing technologies the next generation," Nature Reviews Genetics, vol. 11, no. 1, pp. 31- 46, 2010.   DOI
68 O. Keller, M. Kollmar, M. Stanke, and S. Waack, "A novel hybrid gene prediction method employing protein multiple sequence alignments," Bioinformatics, vol. 27, no. 6, pp. 757-763, 2011.   DOI
69 L. Wang, H. J. Park, S. Dasari, S. Wang, J. P. Kocher, and W. Li, "CPAT: coding-potential assessment tool using an alignment-free logistic regression model," Nucleic Acids Research, vol. 41, no. 6, article ID. e74, 2013.
70 S. Washietl, S. Findeiss, S. A. Müller, S. Kalkhof, M. von Bergen, I. L. Hofacker, P. F. Stadler, and N. Goldman, "RNAcode: robust discrimination of coding and noncoding regions in comparative sequence data," RNA, vol. 17, no. 4, p. 578-594, 2011.   DOI
71 W. Trimble, K. Keegan, M. D'Souza, A. Wilke, J. Wilkening, J. Gilbert, and F. Meyer, "Short-read readingframe predictors are not created equal: sequence error causes loss of signal," BMC Bioinformatics, vol. 13, no. 1, article ID. 183, 2012.
72 N. E. Castellana, S. H. Payne, Z. Shen, M. Stanke, V. Bafna, and S. P. Briggs, "Discovery and revision of arabidopsis genes by proteogenomics," Proceedings of the National Academy of Sciences, vol. 105, no. 52, pp. 21034-21038, 2008.
73 J. Usuka and V. Brendel, "Gene structure prediction by spliced alignment of genomic DNA with protein sequences: increased accuracy by differential splice site scoring," Journal of Molecular Biology, vol. 297, no. 5, pp. 1075-1085, 2000.   DOI
74 E. Birney, M. Clamp, and R. Durbin, "GeneWise and genomewise," Genome Research, vol. 14, no. 5, p. 988- 995, 2004.   DOI
75 I. B. Rogozin, L. Milanesi, and N. A. Kolchanov, "Gene structure prediction using information on homologous protein sequence," Computer Applications in the Biosciences (CABIOS), vol. 12, no. 3, pp. 161-170, 1996.
76 O. Gotoh, "Homology-based gene structure prediction: simplified matching algorithm using a translated codon (tron) and improved accuracy by allowing for long gaps," Bioinformatics, vol. 16, no. 3, pp. 190-202, 2000.   DOI
77 J. Wu, "Improving the specificity of exon prediction using comparative genomics," BMC Genomics, vol. 9, no. Suppl 2, article ID. S13, 2008.
78 S. Hunter, R. Apweiler, T. K. Attwood, A. Bairoch, A. Bateman, D. Binns, et al., "Interpro: the integrative protein signature database," Nucleic Acids Research, vol. 37, no. Suppl 1, pp. D211-D215, 2009.   DOI
79 M. O. Dayhoff and R. M. Schwartz, "A model of evolutionary change in proteins," Atlas of Protein Sequence and Structure, vol. 5, pp. 345-252, 1978.
80 S. Henikoff and J. G. Henikoff, "Amino acid substitution matrices from protein blocks," Proceedings of the National Academy of Sciences, vol. 89, no. 22, pp. 10915-10919, 1992.
81 M. S. Gelfand, A. A. Mironov, and P. A. Pevzner, "Gene recognition via spliced sequence alignment." Proceedings of the National Academy of Sciences, vol. 93, no. 17, pp. 9061-9066, 1996.
82 M. Stanke, A. Tzvetkova, and B. Morgenstern, "AUGUSTUS at EGASP: using EST, protein and genomic alignments for improved gene prediction in the human genome," Genome Biology, vol. 7, no. Suppl 1, article ID. S11, 2006.
83 Y. Xu and E. C. Uberbacher, "Gene prediction by pattern recognition and homology search," in Proceeding of the 4th International Conference on Intelligent Systems for Molecular Biology, St. Louis, MO, 1996, pp. 241-251.
84 Y. Cai and P. Bork, "Homology-based gene prediction using neural nets," Analytical Biochemistry, vol. 265, no. 2, pp. 269-274, 1998.   DOI
85 D. Rose, M. Hiller, K. Schutt, J. Hackermller, R. Backofen, and P. F. Stadler, "Computational discovery of human coding and non-coding transcripts with conserved splice sites," Bioinformatics, vol. 27, no. 14, pp. 1894-1900, 2011.   DOI
86 R. Guigo, P. Flicek, J. Abril, A. Reymond, J. Lagarde, F. Denoeud, et al., "EGASP: the human ENCODE genome annotation assessment project," Genome Biology, vol. 7, no. Suppl 1, article ID. S2, 2006.
87 W. Klimke, C. O'Donovan, O. White, J. R. Brister, K. Clark, B. Fedoro, and T. Tatusova, "Solving the problem: genome annotation standards before the data deluge," Standards in Genomic Sciences, vol. 5, no. 1, pp. 168-193, 2011.   DOI
88 ENCODE Project Consortium, "An integrated encyclopedia of DNA elements in the human genome," Nature, vol. 489, no. 7414, pp. 57-74, 2012.   DOI
89 S. Djebali, C. A. Davis, A. Merkel, A. Dobin, T. Lassmann, A. Mortazavi, et al., "Landscape of transcription in human cells," Nature, vol. 489, no. 7414, pp. 101-108, 2012.   DOI
90 J. E. Allen and S. L. Salzberg, "JIGSAW: integration of multiple sources of evidence for gene prediction," Bioinformatics, vol. 21, no. 18, pp. 3596-3603, 2005.   DOI
91 L. Pachter, M. Alexandersson, and S. Cawley, "Applications of generalized pair hidden Markov models to alignment and gene finding problems," Journal of Computational Biology, vol. 9, no. 2, pp. 389-399, 2002.   DOI
92 T. Larsen and A. Krogh, "EasyGene: a prokaryotic gene finder that ranks ORFs by statistical significance," BMC Bioinformatics, vol. 4, article ID. 21, 2003.
93 G. Parra, P. Agarwal, J. F. Abril, T. Wiehe, J. W. Fickett, and R. Guigo, "Comparative gene prediction in human and mouse," Genome Research, vol. 13, no. 1, pp. 108-117, 2003.   DOI
94 R. A. Tesorero, N. Yu, J. O. Wright, J. P. Svencionis, Q. Cheng, J. H. Kim, and K. H. Cho, "Novel regulatory small RNAs in streptococcus pyogenes," PLoS ONE, vol. 8, no. 6, article ID. e64021, 2013.
95 Y. Zhou, Y. Liang, C. Hu, L. Wang, and X. Shi, "An artificial neural network method for combining gene prediction based on equitable weights," Neurocomputing, vol. 71, no. 4-6, pp. 538-543, 2008.   DOI
96 A. Krogh, "Two methods for improving performance of an hmm and their application for gene finding," in Proceeding of the 5th International Conference on Intelligent Systems for Molecular Biology, Chalkidikee, Greece, 1997, pp. 179-186.
97 A. L. Delcher, D. Harmon, S. Kasif, O. White, and S. L. Salzberg, "Improved microbial gene identification with glimmer," Nucleic Acids Research, vol. 27, no. 23, pp. 4636-4641, 1999.   DOI
98 M. E. Dinger, K. C. Pang, T. R. Mercer, and J. S. Mattick, "Differentiating protein-coding and noncoding RNA: challenges and ambiguities," PLoS Computational Biology, vol. 4, no. 11, article ID. e1000176, 2008.
99 J. Harrow, A. Nagy, A. Reymond, T. Alioto, L. Patthy, S. Antonarakis, and R. Guigo, "Identifying protein-coding genes in genomic sequences," Genome Biology, vol. 10, no. 1, article ID. 201, 2009.
100 M. Hiller, B. T. Schaar, and G. Bejerano, "Hundreds of conserved noncoding genomic regions are independently lost in mammals," Nucleic Acids Research, vol. 40, no. 22, pp. 11463-11476, 2012.   DOI
101 J. W. Fickett, "Finding genes by computer: the state of the art," Trends in Genetics, vol. 12, no. 8, pp. 316-320, 1996.   DOI
102 C. Mathe, M. F. Sagot, T. Schiex, and P. Rouze, "Current methods of gene prediction, their strengths and weaknesses," Nucleic Acids Research, vol. 30, no. 19, pp. 4103-4117, 2002.   DOI
103 R. She, "Fast and accurate gene prediction by protein homology," Ph.D. dissertation, Simon Fraser University, Burnaby, British Columbia, Canada, 2010.
104 N. Goel, S. Singh, and T. C. Aseri, "A review of soft computing techniques for gene prediction," ISRN Genomics, vol. 2013, article ID. 191206, 2013.
105 C. Yang, E. Bolotin, T. Jiang, F. M. Sladek, and E. Martinez, "Prevalence of the initiator over the TATA box in human and yeast genes and identification of DNA motifs enriched in human TATA-less core promoters," Gene, vol. 389, no. 1, pp. 52-65, 2007.   DOI
106 P. Bucher, "Weight matrix descriptions of four eukaryotic RNA polymerase II promoter elements derived from 502 unrelated promoter sequences," Journal of Molecular Biology, vol. 212, no. 4, pp. 563-578, 1990.   DOI
107 A. Coghlan, T. J. Fiedler, S. J. McKay, P. Flicek, T. W. Harris, D. Blasiar, and L. D. Stein, "nGASP: the nematode genome annotation assessment project," BMC Bioinformatics, vol. 9, article ID. 549, 2008.
108 M. Burset and R. Guigo, "Evaluation of gene structure prediction programs," Genomics, vol. 34, no. 3, pp. 353- 367, 1996.   DOI
109 J. Nasiri, M. Naghavi, S. N. Rad, T. Yolmeh, M. Shirazi, R. Naderi, M. Nasiri, and S. Ahmadi, "Gene identification programs in bread wheat: a comparison study," Nucleosides, Nucleotides and Nucleic Acids, vol. 32, no. 10, pp. 529-554, 2013.   DOI
110 W. Kent, C. Sugnet, T. Furey, K. Roskin, T. Pringle, A. Zahler, and D. Haussler, "UCSC genome browser," Genome Research, vol. 12, no. 6, pp. 996-1006, 2002.   DOI
111 C. elegans Sequencing Consortium, "Genome sequence of the nematode C. elegans: a platform for investigating biology," Science, vol. 282, no. 5396, pp. 2012-2018, 1998.   DOI
112 N. Chen, T. W. Harris, I. Antoshechkin, C. Bastiani, T. Bieri, D. Blasiar, et al., "WormBase: a comprehensive data resource for Caenorhabditis biology and genomics," Nucleic Acids Research, vol. 33, no. Suppl 1, pp. D383- D389, 2005.
113 A. Rogers, I. Antoshechkin, T. Bieri, D. Blasiar, C. Bastiani, P. Canaran, et al., "WormBase 2007," Nucleic Acids Research, vol. 36, no. Suppl 1, pp. D612-D617, 2008.
114 T. Steijger, J. F. Abril, P. G. Engström, F. Kokocinski, Consortium, T. J. Hubbard, R. Guigo, J. Harrow, and P. Bertone, "Assessment of transcript reconstruction methods for RNA-seq," Nature Methods, vol. 10, no. 12, pp. 1177-1184, 2013.   DOI
115 M. Vilardell, G. Parra, and S. Civit, "WISCOD: a statistical web-enabled tool for the identification of significant protein coding regions," BioMed Research International, vol. 2014, article ID. 282343, 2014.
116 J. W. Fickett, "Recognition of protein coding regions in DNA sequences," Nucleic Acids Research, vol. 10, no. 17, pp. 5303-5318, 1982.   DOI
117 M. Q. Zhang, "Computational prediction of eukaryotic protein-coding genes," Nature Reviews Genetics, vol. 3, no. 9, pp. 698-709, 2002.   DOI
118 C. Trapnell, L. Pachter, and S. L. Salzberg, "TopHat: discovering splice junctions with RNA-seq," Bioinformatics, vol. 25, no. 9, pp. 1105-1111, 2009.   DOI
119 M. Akhtar, J. Epps, and E. Ambikairajah, "Signal processing in sequence analysis: advances in eukaryotic gene prediction," IEEE Journal of Selected Topics in Signal Processing, vol. 2, no. 3, pp. 310-321, 2008.   DOI
120 D. Kotlar and Y. Lavner, "Gene prediction by spectral rotation measure: a new method for identifying proteincoding regions," Genome Research, vol. 13, no. 8, pp. 1930-1937, 2003.   DOI
121 N. Yu, X. Guo, F. Gu, and Y. Pan, "DNA AS X: an information-coding based model to improve the sensitivity in comparative gene analysis," in Proceedings of the 11th International Symposium on Bioinformatics Research and Applications, Norfolk, VA, 2015, pp. 366-377.
122 R. F. Voss, "Evolution of long-range fractal correlations and 1/f noise in DNA base sequences," Physical Review Letters, vol. 68, no. 25, pp. 3805-3808, 1992.   DOI
123 I. Cosic, "Macromolecular bioactivity: is it resonant interaction between macromolecules? Theory and applications," IEEE Transactions on Biomedical Engineering, vol. 41, no. 12, pp. 1101-1114, 1994.   DOI
124 H. K. Kwan and S. Arniker, "Numerical representation of DNA sequences," in Proceedings of IEEE International Conference on Electro/Information Technology (eit'09), Windsor, ON, 2009, pp. 307-310.
125 S. Spicuglia, M. A. Maqbool, D. Puthier, and J. C. Andrau, "An update on recent methods applied for deciphering the diversity of the noncoding RNA genome structure and function," Methods, vol. 63, no. 1, pp. 3-17, 2013.   DOI
126 G. St Laurent, D. Shtokalo, M. Tackett, Z. Yang, T. Eremina, C. Wahlestedt, et al., "Intronic RNAs constitute the major fraction of the noncoding RNA in mammalian cells," BMC Genomics, vol. 13, no. 1, article ID. 504, 2012.
127 Y. Bai, J. Hassler, A. Ziyar, P. Li, Z. Wright, R. Menon, et al., "Novel bioinformatics method for identification of genome-wide non-canonical spliced regions using RNA-Seq data," PLoS ONE, vol. 9, no. 7, article ID. e100864, 2014.
128 H. Wang, P. J. Chung, J. Liu, I. C. Jang, M. Kean, J. Xu, and N. H. Chua, "Genome-wide identification of long noncoding natural antisense transcripts and their responses to light in Arabidopsis," Genome Research, vol. 24, no. 3, pp. 444-453, 2014.   DOI
129 J. W. Nam and D. P. Bartel, "Long noncoding RNAs in C. elegans," Genome Research, vol. 22, no. 12, pp. 2529- 2540, 2012.   DOI
130 R. Weikard, F. Hadlich, and C. Kuehn, "Identification of novel transcripts and noncoding RNAs in bovine skin by deep next generation sequencing," BMC Genomics, vol. 14, no. 1, article ID. 789, 2013.
131 N. L. Barbosa-Morais, M. Irimia, Q. Pan, H. Y. Xiong, S. Gueroussov, L. J. Lee, et al., "The evolutionary landscape of alternative splicing in vertebrate species," Science, vol. 338, no. 6114, pp. 1587-1593, 2012.   DOI
132 Q. Pan, O. Shai, L. J. Lee, B. J. Frey, and B. J. Blencowe, "Deep surveying of alternative splicing complexity in the human transcriptome by high-throughput sequencing," Nature Genetics, vol. 40, no. 12, pp. 1413-1415, 2008.   DOI
133 N. Rao and S. Shepherd, "Detection of 3-periodicity for small genomic sequences based on AR technique," in Proceedings of 2004 International Conference on Communications, Circuits and Systems (ICCCAS2004), Cheongdu, China, 2004, pp. 1032-1036.
134 B. D. Silverman and R. Linsker, "A measure of DNA periodicity," Journal of Theoretical Biology, vol. 118, no. 3, pp. 295-300, 1986.   DOI
135 S. Tiwari, S. Ramachandran, A. Bhattacharya, S. Bhattacharya, and R. Ramaswamy, "Prediction of probable genes by fourier analysis of genomic sequences," Computer Applications in the Biosciences (CABIOS), vol. 13, no. 3, pp. 263-270, 1997.
136 D. Anastassiou, "Frequency-domain analysis of biomolecular sequences," Bioinformatics, vol. 16, no. 12, pp. 1073-1081, 2000.   DOI
137 G. Liu and Y. Luan, "Identification of protein coding regions in the eukaryotic DNA sequences based on marple algorithm and wavelet packets transform," Abstract and Applied Analysis, vol. 2014, article ID. 402567, 2014.
138 G. Zhang and G. Zhou, "The Marple algorithm for the autoregressive spectral estimates of the SMMW Fourier transform spectroscopy data," International Journal of Infrared and Millimeter Waves, vol. 10, no. 2, pp. 257-267, 1989.   DOI
139 I. Barrodale, L. M. Delves, R. E. Erickson, and C. A. Zala, "Computational experience with Marple's algorithm for autoregressive spectrum analysis," Geophysics, vol. 48, no. 9, pp. 1274-1286, 1983.   DOI
140 O. Abbasi, A. Rostami, and G. Karimian, "Identification of exonic regions in DNA sequences using crosscorrelation and noise suppression by discrete wavelet transform," BMC Bioinformatics, vol. 12, article ID. 430, 2011.
141 M. Hiller, S. Agarwal, J. H. Notwell, R. Parikh, H. Guturu, A. M. Wenger, and G. Bejerano, "Computational methods to detect conserved non-genic elements in phylogenetically isolated genomes: application to zebrafish," Nucleic Acids Research, vol. 41, no. 15, article ID. e151, 2013.
142 H. Ohmiya, M. Vitezic, M. Frith, M. Itoh, P. Carninci, A. Forrest, et al., "Reclu: a pipeline to discover reproducible transcriptional start sites and their alternative regulation using capped analysis of gene expression (cage)," BMC Genomics, vol. 15, no. 1, article ID. 269, 2014.
143 Y. Li, H. Li-Byarlay, P. Burns, M. Borodovsky, G. E. Robinson, and J. Ma, "TrueSight: a new algorithm for splice junction detection using RNA-seq," Nucleic Acids Research, vol. 41, no. 4, article ID. e51, 2013.
144 P. D. Burns, Y. Li, J. Ma, and M. Borodovsky, "UnSplicer: mapping spliced RNA-seq reads in compact genomes and filtering noisy splicing," Nucleic Acids Research, vol. 42, no. 4, article ID. e25, 2014.
145 S. Lertampaiporn, C. Thammarongtham, C. Nukoolkit, B. Kaewkamnerdpong, and M. Ruengjitchatchawalya, "Identification of non-coding RNAs with a new composite feature in the hybrid random forest ensemble algorithm," Nucleic Acids Research, vol. 42, no. 11, article ID. e93, 2014.
146 C. De Filippo, M. Ramazzotti, P. Fontana, and D. Cavalieri, "Bioinformatic approaches for functional annotation and pathway inference in metagenomics data," Briefings in Bioinformatics, vol. 13, no. 6, pp. 696- 710, 2012.   DOI
147 H. Soueidan and M. Nikolski, "Machine learning for metagenomics: methods and tools," Oct. 2015; http://arxiv.org/pdf/1510.06621v1.pdf.
148 E. Wijaya, M. C. Frith, P. Horton, and K. Asai, "Finding protein-coding genes through human polymorphisms," PLoS ONE, vol. 8, no. 1, article ID. e54210, 2013.
149 H. K. Kwan, R. Atwal, and B. Y. M. Kwan, "Wavelet analysis of DNA sequences," in Proceedings of International Conference on Communications, Circuits and Systems (ICCCAS2008), Fujian, China, 2008, pp. 816-820.
150 S. Deng, L. Yuan, K. Feng, G. Ding, and Y. Li, "A new approach for identifying protein-coding regions by combining chirp z and wavelet transform," Current Bioinformatics, vol. 8, no. 5, pp. 557-563, 2013.   DOI
151 E. Ambikairajah, J. Epps, and M. Akhtar, "Gene and exon prediction using time domain algorithms," in Proceedings of the 8th International Symposium on Signal Processing and Its Applications (ISSPA2005), Sydney, Australia, 2005, pp. 199-202.
152 M. Akhtar, J. Epps, and E. Ambikairajah, "Time and frequency domain methods for gene and exon prediction in eukaryotes," in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP2007), Honolulu, HI, 2007, pp. 573-576.
153 M. Roy and S. Barman, "Effective gene prediction by high resolution frequency estimator based on least-norm solution technique," EURASIP Journal on Bioinformatics and Systems Biology, vol. 2014, no. 1, pp. 1-13, 2014.   DOI
154 S. S. Sahu and G. Panda, "Identification of protein-coding regions in DNA sequences using a time-frequency filtering approach," Genomics, Proteomics & Bioinformatics, vol. 9, no. 1-2, pp. 45-55, 2011.   DOI
155 S. Deng, Y. Shi, L. Yuan, Y. Li, and G. Ding, "Detecting the borders between coding and non-coding DNA regions in prokaryotes based on recursive segmentation and nucleotide doublets statistics," BMC Genomics, vol. 13, no. Suppl 8, article ID. S19, 2012.   DOI
156 S. Mereuta and V. Munteanu, "A new information theoretic approach to exon-intron classification," in Proceedings of International Symposium on Signals, Circuits and Systems (ISSCS2007), Iasi, Romania, 2007, pp. 1-4.
157 F. S. Collins, L. D. Brooks, and A. Chakravarti, "A DNA polymorphism discovery resource for research on human genetic variation," Genome Research, vol. 8, no. 12, pp. 1229-1231, 1998.   DOI
158 M. Rho, H. Tang, and Y. Ye, "FragGeneScan: predicting genes in short and error-prone reads," Nucleic Acids Research, vol. 38, no. 20, article ID. e191, 2010.
159 D. Hyatt, G. L. Chen, P. F. LoCascio, M. L. Land, F. W. Larimer, and L. J. Hauser, "Prodigal: prokaryotic gene recognition and translation initiation site identification," BMC Bioinformatics, vol. 11, article ID. 119, 2010.
160 H. Noguchi, T. Taniguchi, and T. Itoh, "MetaGeneAnnotator: detecting species-specific patterns of ribosomal binding site for precise gene prediction in anonymous prokaryotic and phage genomes," DNA Research, vol. 15, no. 6, pp. 387-396, 2008.   DOI
161 S. J. Lee, K. A. Usmani, B. Chanas, B. Ghanayem, T. Xi, E. Hodgson, H. W. Mohrenweiser, and J. A. Goldstein, "Genetic findings and functional studies of human CYP3A5 single nucleotide polymorphisms in different ethnic groups." Pharmacogenetics, vol. 13, no. 8, pp. 461-472, 2003.   DOI
162 N. Elango and S. V. Yi, "Functional relevance of CpG island length for regulation of gene expression," Genetics, vol. 187, no. 4, pp. 1077-1083, 2011.   DOI
163 P. Deininger, "Alu elements: know the SINEs," Genome Biology, vol. 12, no. 12, article ID. 236, 2011.
164 B. Hutter, V. Helms, and M. Paulsen, "Tandem repeats in the CpG islands of imprinted genes," Genomics, vol. 88, no. 3, pp. 323-332, 2006.   DOI
165 A. L. Brunner, D. S. Johnson, S. W. Kim, A. Valouev, T. E. Reddy, N. F. Neff, et al., "Distinct DNA methylation patterns characterize differentiated human embryonic stem cells and developing human fetal liver," Genome Research, vol. 19, no. 6, pp. 1044-1056, 2009.   DOI
166 A. Lomsadze, V. Ter-Hovhannisyan, Y. O. Chernoff, and M. Borodovsky, "Gene identification in novel eukaryotic genomes by self-training algorithm," Nucleic Acids Research, vol. 33, no. 20, pp. 6494-6506, 2005.   DOI
167 W. Zhu, A. Lomsadze, and M. Borodovsky, "Ab initio gene identification in metagenomic sequences," Nucleic Acids Research, vol. 38, no. 12, article ID. e132, 2010.
168 M. Borodovsky and J. McIninch, "Genmark: parallel gene recognition for both DNA strands," Computers & Chemistry, vol. 17, no. 2, pp. 123-133, 1993.   DOI
169 C. Burge and S. Karlin, "Prediction of complete gene structures in human genomic DNA," Journal of Molecular Biology, vol. 268, no. 1, pp. 78-94, 1997.   DOI
170 D. Kulp, D. Haussler, M. G. Reese, and F. H. Eeckman, "A generalized hidden Markov model for the recognition of human genes in DNA," in Proceeding of the 4th International Conference on Intelligent Systems for Molecular Biology, St. Louis, MO, 1996, pp. 134-142.
171 L. R. Rabiner, "A tutorial on hidden markov models and selected applications in speech recognition," in Readings in Speech Recognition, A. Waibel and K. F. Lee, Eds. San Francisco, CA: Morgan Kaufmann Publishers, 1990, pp. 267-296.
172 D. Sankoff, "Efficient optimal decomposition of a sequence into disjoint regions, each matched to some template in an inventory," Mathematical Biosciences, vol. 111, no. 2, pp. 279-293, 1992.   DOI
173 A. J. Viterbi, "Error bounds for convolutional codes and an asymptotically optimum decoding algorithm," IEEE Transactions on Information Theory, vol. 13, no. 2, pp. 260-269, 1967.   DOI
174 V. Ter-Hovhannisyan, A. Lomsadze, Y. O. Chernoff, and M. Borodovsky, "Gene prediction in novel fungal genomes using an ab initio algorithm with unsupervised training," Genome Research, vol. 18, no. 12, p. 1979- 1990, 2008.   DOI
175 I. Wallach, M. Dzamba, and A. Heifets, "AtomNet: a deep convolutional neural network for bioactivity prediction in structure-based drug discovery," Oct. 2015; http://arxiv.org/pdf/1510.02855v1.pdf.
176 H. Wu, B. Caffo, H. A. Jaffee, R. A. Irizarry, and A. P. Feinberg, "Redefining CpG islands using hidden Markov models," Biostatistics, vol. 11, no. 3, pp. 499-514, 2010.   DOI
177 N. Yu, X. Guo, A. Zelikovsky, and Y. Pan, "GaussianCpG: a Gaussian model for detection of human CpG island," in Proceedings of IEEE 5th International Conference on Computational Advances in Bio and Medical Sciences (ICCABS), Miami, FL, 2015.
178 L. Deng and D. Yu, "Deep learning: methods and applications," May 2014; http://research.microsoft.com/apps/pubs/default.aspx?id=209355.
179 B. Ramsundar, S. Kearnes, P. Riley, D. Webster, D. Konerding, and V. Pande, "Massively multitask networks for drug discovery," Feb. 2015; http://arxiv.org/pdf/1502.02072v1.pdf.
180 D. Chicco, P. Sadowski, and P. Baldi, "Deep autoencoder neural networks for gene ontology annotation predictions," in Proceedings of the 5th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics (BCB'14), Washington, DC, 2014, pp. 533-540.
181 R. Raina, A. Madhavan, and A. Y. Ng, "Large-scale deep unsupervised learning using graphics processors," in Proceedings of the 26th Annual International Conference on Machine Learning (ICML'09), Montreal, QC, 2009, pp. 873-880.
182 X. Guo, Y. Meng, N. Yu, and Y. Pan, "Cloud computing for detecting high-order genome-wide epistatic interaction via dynamic clustering," BMC Bioinformatics, vol. 15, no. 1, article ID. 102, 2014.
183 E. E. Snyder and G. D. Stormo, "Identification of protein coding regions in genomic DNA," Journal of Molecular Biology, vol. 248, no. 1, pp. 1-18, 1995.   DOI
184 A. Lomsadze, P. D. Burns, and M. Borodovsky, "Integration of mapped RNA-seq reads into automatic training of eukaryotic gene finding algorithm," Nucleic Acids Research, vol. 42, no. 15, article ID. e119, 2014.
185 R. Staden, "Computer methods to locate signals in nucleic acid sequences," Nucleic Acids Research, vol. 12, no. 1 (Pt 2), pp. 505-519, 1984.   DOI
186 R. Guigo, S. Knudsen, N. Drake, and T. Smith, "Prediction of gene structure," Journal of Molecular Biology, vol. 226, no. 1, pp. 141-157, 1992.   DOI
187 M. Q. Zhang and T. G. Marr, "A weight array method for splicing signal analysis," Computer applications in the Biosciences (CABIOS), vol. 9, no. 5, pp. 499-509, 1993.
188 J. Henderson, S. Salzberg, and K. H. Fasman, "Finding genes in DNA with a hidden Markov model," Journal of Computational Biology, vol. 4, no. 2, pp. 127-141, 1997.   DOI
189 I. Korf, P. Flicek, D. Duan, and M. R. Brent, "Integrating genomic homology into gene structure prediction," Bioinformatics, vol. 17, no. Suppl 1, pp. S140-S148, 2001.   DOI
190 J. Wu and D. Haussler, "Coding exon detection using comparative sequences," Journal of Computational Biology, vol. 13, no. 6, pp. 1148-1164, 2006.   DOI
191 W. H. Majoros, M. Pertea, and S. L. Salzberg, "TigrScan and GlimmerHMM: two open source ab initio eukaryotic gene-finders," Bioinformatics, vol. 20, no. 16, pp. 2878-2879, 2004.   DOI
192 E. C. Uberbacher and R. J. Mural, "Locating protein-coding regions in human DNA sequences by a multiple sensor-neural network approach," Proceedings of the National Academy of Sciences, vol. 88, no. 24, pp. 11261- 11265, 1991.
193 A. Motahari, G. Bresler, and D. Tse, "Information theory of DNA shotgun sequencing," IEEE Transactions on Information Theory, vol. 59, no. 10, pp. 6273-6289, 2013.   DOI
194 T. H. Chang, S. L. Wu, W. J. Wang, J. T. Horng, and C. W. Chang, "A novel approach for discovering conditionspecific correlations of gene expressions within biological pathways by using cloud computing technology," BioMed Research International, vol. 2014, article ID. 18, 2014.
195 X. Guo, N. Yu, B. Li, and Y. Pan, "Cloud computing for NGS data analysis," in Computational Methods for Next Generation Sequencing Data Analysis. Hoboken, NJ: Wiley, 2016.
196 J. Yee, M. S. Kwon, T. Park, and M. Park, "A modified entropy-based approach for identifying gene-gene interactions in case-control study," PLoS ONE, vol. 8, no. 7, article ID. e69321, 2013.
197 A. Ghosh and R. K. De, "A fuzzy entropy based approach for development of gene prediction networks (GPNs): detecting altered dependency in carcinogenic state," in Proceedings of the 2nd ACM Conference on Bioinformatics, Computational Biology and Biomedicine (BCB'11), Chicago, IL, 2011, pp. 320-324.
198 L. Galleani and R. Garello, "The minimum entropy mapping spectrum of a DNA sequence," IEEE Transactions on Information Theory, vol. 56, no. 2, pp. 771-783, 2010.   DOI
199 Z. Ouyang, H. Zhu, J. Wang, and Z. S. She, "Multivariate entropy distance method for prokaryotic gene identification," Journal of Bioinformatics and Computational Biology, vol. 2, no. 2, pp. 353-373, 2004.   DOI
200 S. Zhu, D. Wang, K. Yu, T. Li, and Y. Gong, "Feature selection for gene expression using model-based entropy," IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 7, no. 1, pp. 25-36, 2010.   DOI