Computational Approaches to Gene Prediction

  • Do Jin-Hwan (Bio-food and Drug Research Center, Konkuk University) ;
  • Choi Dong-Kug (Department of Biotechnology, Konkuk University)
  • Published : 2006.04.01

Abstract

The problems associated with gene identification and the prediction of gene structure in DNA sequences have been the focus of increased attention over the past few years with the recent acquisition by large-scale sequencing projects of an immense amount of genome data. A variety of prediction programs have been developed in order to address these problems. This paper presents a review of the computational approaches and gene-finders used commonly for gene prediction in eukaryotic genomes. Two approaches, in general, have been adopted for this purpose: similarity-based and ab initio techniques. The information gleaned from these methods is then combined via a variety of algorithms, including Dynamic Programming (DP) or the Hidden Markov Model (HMM), and then used for gene prediction from the genomic sequences.

Keywords

References

  1. Alexandersson, M., S. Cawley, and L. Pachter. 2003. SLAM: cross-species gene finding and alignment with a generalized pair Markov model. Genome Res. 13, 496- 502 https://doi.org/10.1101/gr.424203
  2. Allen, J.E., M. Pertea, and S.L. Salzberg. 2004. Computational gene prediction using multiple sources of evidence. Genome Res. 14, 142-148 https://doi.org/10.1101/gr.1562804
  3. Borodovsky, M. and J. McIninch. 1993. GeneMark: parallel gene recognition for both DNA strands. Comput. Chem. 17, 123-133 https://doi.org/10.1016/0097-8485(93)85004-V
  4. Bucher P. 1990. Weight matrix descriptions of four eukaryotic RNA polymerase II promoter elements derived from 502 unrelated promoter sequences. J. Mol. Biol. 212, 563-578 https://doi.org/10.1016/0022-2836(90)90223-9
  5. Burge, C. and S. Karlin. 1997. Prediction of complete gene structures in human genomic DNA. J. Mol. Biol. 268, 78- 94 https://doi.org/10.1006/jmbi.1997.0951
  6. Burge, C.B. and S. Karlin. 1998. Finding the genes in genomic DN. Curr. Opin. Struct. Biol. 8, 346-354 https://doi.org/10.1016/S0959-440X(98)80069-9
  7. Cawley, S.E., A.I. Wirth, and T.P. Speed. 2001. Phat–a gene finding program for Plasmodium falciparum. Mol. Biochem. Parasitol. 118, 167-174 https://doi.org/10.1016/S0166-6851(01)00363-2
  8. Chechetkin, V.R. and A.Y. Turygin. 1995. Size-dependence of three-periodicity and long-range correlations in DNA sequences. Phys. Lett. A. 199, 75-80 https://doi.org/10.1016/0375-9601(95)00047-7
  9. Do, J.H., M.J. Anderson, D.W. Denning, and E. Bornberg- Bauer. 2004. Inference of Aspergillus fumigatus pathways by comparative genome analysis: tricarboxylic acid cycle (TCA). J. Microbiol. Biotechnol. 14, 74-80
  10. Do, J.H., T.K. Park, and D.-K. Choi. 2005a. A computational approach to the inference of sphingolipid pathways from the genome of Aspergillus fumigatus. Curr. Genet. 48, 134-141 https://doi.org/10.1007/s00294-005-0009-2
  11. Do, J.H., B.Y. Lim, W.S. Choi, and D.-K. Choi. 2005b. Exploring the Phospholipid Biosynthetic Pathways of Aspergillus fumigatus by Computational Genome Analysis. Eng. Life Sci. 5(6). 574-579 https://doi.org/10.1002/elsc.200520102
  12. Kim, K.B. and J.S. Sim. 2005. Computational detection of prokaryotic core promoters in genomic sequences. J. Microbiol. 43, 411-416
  13. Fickett, J. 1982. Recognition of protein-coding regions in DNA sequences. Nucleic Acids Res. 10, 5303-5318 https://doi.org/10.1093/nar/10.17.5303
  14. Fickett, J.W. and C.S. Tung. 1992. Assessment of protein coding measures. Nucleic Acids Res. 20, 6441-6450 https://doi.org/10.1093/nar/20.24.6441
  15. Fleischmann, R.D., M.D. Adams, O. White, R.A. Clayton, E.F. Kirkness, A.R. Kerlavage, et al. 1995. Whole-genome random sequencing and assembly of Haemophilus influenza Rd. Science 269, 496-512 https://doi.org/10.1126/science.7542800
  16. Flicek, P., E. Keibler, P. Hu, I. Korf, and M.R. Brent. 2003. Leveraging the mouse genome for gene prediction in human: from whole-genome shotgun reads to a global synteny map. Genome Res. 13, 46-54 https://doi.org/10.1101/gr.830003
  17. Florea, L., G. Hartzell, Z. Zhang, G.M. Rubin, and W. Miller. 1998. A computer program for aligning a cDNA sequence with a genomic DNA sequence. Genome Res. 8, 967-974 https://doi.org/10.1101/gr.8.9.967
  18. Fukunishi, Y., H. Suzuki, M. Yoshino, H. Konno, and Y. Hayashizaki. 1999. Prediction of human cDNA from its homologous mouse full-length cDNA and human shotgun database. FEBS Lett. 464, 129-132 https://doi.org/10.1016/S0014-5793(99)01696-8
  19. Gaasterland, T. and C.W. Sensen. 1996. Fully automated genome analysis that reflects user needs and preferences. A detailed introduction to the MAGPIE system architecture. Biochimie 78, 302-310 https://doi.org/10.1016/0300-9084(96)84761-4
  20. Gribskov, M., J. Devereux, and R.R. Burgess. 1984. The codon preference plot: graphic analysis of protein coding sequences and prediction of gene expression. Nucleic Acids Res. 12, 539-549 https://doi.org/10.1093/nar/12.1Part2.539
  21. Guigo, R., P. Agarwal, J.F. Abril, M. Burset, and J.W. Fickett. 2000. An assessment of gene prediction accuracy in large DNA sequences. Genome Res. 10, 1631-1642 https://doi.org/10.1101/gr.122800
  22. Guo, F.B., H.Y. Ou, and C.T. Zhang. 2003. ZCURVE: a new system for recognizing protein-coding genes in bacterial and archael genomes. Nucleic Acids Res. 31, 1780-1789 https://doi.org/10.1093/nar/gkg254
  23. Harris, N.L. 1997. Genotator: a workbench for sequence annotation. Genome Res. 7, 754-762 https://doi.org/10.1101/gr.7.7.754
  24. Huang, X., M.D. Adams, H. Zhou, and A.R. Kerlavage. 1997. A tool for analyzing and annotating genomic sequences. Genomics 46, 37-45 https://doi.org/10.1006/geno.1997.4984
  25. Hubbard, T., D. Barker, E. Birney, G. Cameron, Y. Chen, L. Clark, T. Cox, J. Cuff, V. Curwen, T. Down, et al. 2002. The Ensembl genome database project. Nucleic Acids Res. 30, 38-41 https://doi.org/10.1093/nar/30.1.38
  26. Hutchinson, G.B. and M.R. Hayden. 1992. The prediction of exons through an analysis of spliceable open reading frames. Nucleic Acids Res. 20, 3453-3462 https://doi.org/10.1093/nar/20.13.3453
  27. Juvvadi, P.R., Y. Seshime, and K. Kitamoto. 2005. Genomics reveals traces of fungal phenylpropanoid-flavonoid metabolic pathway in the filamentous fungus Aspergillus oryzae. J. Microbiol. 43(6). 475-486
  28. Kleffe, J., K. Hermann, W. Vahrson, B. Wittig, and V. Brendel. 1996. Logitlinear models for the prediction of splice sites in plant pre-mRNA sequences. Nucleic Acids Res. 24, 4709-4718 https://doi.org/10.1093/nar/24.23.4709
  29. Kotlar, D. and Y. Lavner. 2003. Gene prediction by spectral rotation measure: a new method for identifying proteincoding regions. Genome Res. 13, 1930-1937
  30. Krogh, A. 2000. Using database matches with HMMgene for automated gene detection in Drosophila. Genome Res. 10, 523-528 https://doi.org/10.1101/gr.10.4.523
  31. Maniatis, T. and B. Tasic. 2002. Alternative pre-mRNA splicing and proteome expansion in metazoans. Nature 418. 236-243 https://doi.org/10.1038/418236a
  32. Mathe, C., M-F. Sagot, T. Schiex, and P. Rouze. 2002. Current methods of gene prediction, their strengths and weaknesses. Nucleic Acids Res. 30, 4103-4117 https://doi.org/10.1093/nar/gkf543
  33. Pedersen, A.G. and H. Nielsen. 1997. Neural network prediction of translation initiation sites in eukaryotes: perspectives for EST and genome analysis, p. 226-233. In T. Gaasterland et al. (eds). The Fifth International Conference on Intelligence Systems for Molecular Biology. AAAI Press, Menlo Park, CA
  34. Reese, M.G., D. Kulp, H. Tammana, and D. Haussler. 2000. Genie–gene finding in Drosophila melanogaster. Genome Res. 10, 529-538 https://doi.org/10.1101/gr.10.4.529
  35. Robison, K., W. Gilbert, and G. Church. 1994. Large-scale bacterial gene discovery by similarity search. Nat. Genet. 7, 205-214 https://doi.org/10.1038/ng0694-205
  36. Rogozin, I.B. and L. Milanesi. 1997. Analysis of donor splice signals in different organisms. J. Mol. Evol. 45, 50-59 https://doi.org/10.1007/PL00006200
  37. Salamov, A.A. and V.V. Solovyev. 2000. Ab initio gene finding in Drosophila genomic DNA. Genome Res. 10, 391-393 https://doi.org/10.1101/gr.10.4.391
  38. Salzberg, S., A. Delcher, S. Kasif, and O. White. 1998. Microbial gene identification using interpolated Markov models. Nucleic Acids Res. 26, 544-548 https://doi.org/10.1093/nar/26.2.544
  39. Salzberg, S.L., M. Pertea, A.L. Delcher, M.J. Gardner, and H. Tettelin. 1999. Interpolated Markov models for eukaryotic gene finding. Genomics 59, 24-31 https://doi.org/10.1006/geno.1999.5854
  40. Schiex, T., A. Moisan, and P. Rouzé. 2001. EuGène: an eukaryotic gene finder that combines several sources of evidence, p. 111-125. In O. Gascuel and M.-F. Sagot (eds). Lecture Notes in Computer Science, Vol. 2006, First International Conference on Biology, Informatics, and Mathematics, JOBIM 2000. Springer-Verlag, Germany
  41. Staden, R. 1984. Measurements of the effect that coding for a protein has on DNA sequence and their use for finding genes. Nucleic Acids Res. 12, 551-567 https://doi.org/10.1093/nar/12.1Part2.551
  42. Staden, R. and A.D. McLachlan. 1982. Codon preference and its use in identifying protein coding regions in long DNA sequences. Nucleic Acids Res. 10, 141-156 https://doi.org/10.1093/nar/10.1.141
  43. Stormo, G.D. 2000. Gene-finding approaches for eukaryotes. Genome Res. 10, 394-397 https://doi.org/10.1101/gr.10.4.394
  44. Takamatsu, K., K. Maekawa, T. Togashi, D.K. Choi, Y. Suzuki, T.D. Taylor et al. 2002. Identification of two novel primate-specific genes in DSCR. DNA Res. 9, 89-97 https://doi.org/10.1093/dnares/9.3.89
  45. Togashi, T., D.K. Choi, T.D. Taylor, Y. Suzuki, S. Sugano, M. Hattor et al. 2000. A novel gene, DSCR5, from the distal Down syndrome critical region on chromosome 21q22.2. DNA Res. 7, 207-212 https://doi.org/10.1093/dnares/7.3.207
  46. Tiwari, S., S. Ramachandran, A. Bhattacharya, S. Bhattacharya, and R. Ramaswamy. 1997. Prediction of probable genes by Fourier analysis of genomic sequences. Comput. Appl. Biosci. 113, 263-270
  47. Tolstrup, N., P. Rouze, and S. Brunak. 1997. A branch point consensus from Arabidopsis found by non-circular analysis allows for better prediction of acceptor sites. Nucleic Acids Res. 25, 3159-3163 https://doi.org/10.1093/nar/25.15.3159
  48. Trifonov, E.N. and J.L. Sussman. 1980. The pitch of chromatin DNA is reflected in its nucleotide sequence. Proc. Natl. Acad. Sci. U.S.A. 77, 3816-3820
  49. Usuka, J., W. Zhu, and V. Brendel. 2000. Optimal spliced alignment of homologous cDNA to a genomic DNA template. Bioinformatics 16, 203-211 https://doi.org/10.1093/bioinformatics/16.3.203
  50. Voss, R. 1992. Evolution of long-range fractal correlations and 1/f noise in DNA base sequences. Phys. Rev. Lett. 68, 3805-3808 https://doi.org/10.1103/PhysRevLett.68.3805
  51. Yada, T., T. Takagi, Y. Totoki, and Y. Sakaki. 2003. DIGIT: a novel gene finding program by combing gene-finders. Pac. Symp. Biocomput. 375-387
  52. Zhang, C.T. and J. Wang. 2000. Recognition of protein coding genes in the yeast genome at better than 95% accuracy based on the Z curve. Nucleic Acids Res. 28, 2804-2814 https://doi.org/10.1093/nar/28.14.2804
  53. Zhang, C.T. and R. Zhang. 1991. Analysis of distribution of bases in the coding sequences by a diagrammatic technique. Nucleic Acids Res. 19, 6313-6317 https://doi.org/10.1093/nar/19.22.6313
  54. Zhang, R. and C.T. Zhang. 1994. Zcurves, an intuitive tool for visualizing and analyzing the DNA sequences. J. Biomol. Struct. Dyn. 11, 767-782 https://doi.org/10.1080/07391102.1994.10508031