DOI QR코드

DOI QR Code

Composite Dependency-reflecting Model for Core Promoter Recognition in Vertebrate Genomic DNA Sequences

  • Kim, Ki-Bong (Department of Bioinformatics Engineering, Sangmyung University) ;
  • Park, Seon-Hee (Electronics and Telecommunications Research Institute)
  • Published : 2004.11.30

Abstract

This paper deals with the development of a predictive probabilistic model, a composite dependency-reflecting model (CDRM), which was designed to detect core promoter regions and transcription start sites (TSS) in vertebrate genomic DNA sequences, an issue of some importance for genome annotation. The model actually represents a combination of first-, second-, third- and much higher order or long-range dependencies obtained using the expanded maximal dependency decomposition (EMDD) procedure, which iteratively decomposes data sets into subsets on the basis of dependency degree and patterns inherent in the target promoter region to be modeled. In addition, decomposed subsets are modeled by using a first-order Markov model, allowing the predictive model to reflect dependency between adjacent positions explicitly. In this way, the CDRM allows for potentially complex dependencies between positions in the core promoter region. Such complex dependencies may be closely related to the biological and structural contexts since promoter elements are present in various combinations separated by various distances in the sequence. Thus, CDRM may be appropriate for recognizing core promoter regions and TSSs in vertebrate genomic contig. To demonstrate the effectiveness of our algorithm, we tested it using standardized data and real core promoters, and compared it with some current representative promoter-finding algorithms. The developed algorithm showed better accuracy in terms of specificity and sensitivity than the promoter-finding ones used in performance comparison.

Keywords

References

  1. Altschul, S., Gish, W., Miller, W., Myers, E. and Lipman, D. (1990) Basic local alignment search tool. J. Mol. Biol. 215, 403-410. https://doi.org/10.1016/S0022-2836(05)80360-2
  2. Bailey, T. and Elkan, C. (1995) Unsupervised learning of multiple motifs in biopolymers using expectation maximization. Machine Learning 21, 51-83.
  3. Brazma, A., Jonassen, I., Vilo, J. and Ukkonen, E. (1998) Predicting gene regulatory elements in silico on a genomic scale. Genome Res. 8, 1202-1215.
  4. Bucher, P. (1990) Weight matrix description of four eukaryotic RNA polymerase II promoter elements derived from 502 unrelated promoter sequences. J. Mol. Biol. 212, 563-578. https://doi.org/10.1016/0022-2836(90)90223-9
  5. Burset, M. and Guigo, R. (1996) Evaluation of gene structure prediction programs. Genomics 34, 353-367. https://doi.org/10.1006/geno.1996.0298
  6. Fickett, J. and Hatzigeorgiou, A. (1997) Eukaryotic promoter recognition. Genome Res. 7, 861-878.
  7. Helden, J. (2004) Metrics for comparing regulatory sequences on the basis of pattern counts. Bioinformatics 20, 399-406. https://doi.org/10.1093/bioinformatics/btg425
  8. Hernandez, E., Johnson, A., Notario, V., Chen, A. and Richert, J. (2002) AUA as a translation initiation site in Vitro for the human transcription factor Sp3. J. Biochem. Mol. Biol. 35, 273-282. https://doi.org/10.5483/BMBRep.2002.35.3.273
  9. Ko, J., Na, D. S., Lee, Y. H., Shin, S. Y., Kim, J. H., Hwang, B. G., Min, B. I. and Park, D. S. (2002) cDNA microarray analysis of the differential gene expression in the neuropathic pain and electroacupunction treatment models. J. Biochem. Mol. Biol. 35, 420-427. https://doi.org/10.5483/BMBRep.2002.35.4.420
  10. Kulp, D., Haussler, D., Reese, M. and Eeckman, F. (1996) A generalized Hidden Markov Model for the recognition of human genes in DNA. ISMB96, States, D. J., Agarwal, P., Gaasterland, T., Hunter, L., Smith, R. (eds.), pp. 134-142, AAAI/MIT Press, St. Louis, USA.
  11. Lawrence, C., Altschul, S., Boguski, M., Liu, J., Neuwald, A. and Wootton, J. (1993) Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. Science 262, 208-214. https://doi.org/10.1126/science.8211139
  12. Ohler, U., Liao, G., Niemann, H. and Rubin G. (2002) Computational analysis of core promoters in the Drosophila genome. Genome Biol. 3, 1-17.
  13. Ohler, U. and Niemann, H. (2001) Identification and analysis of eukaryotic promters: recent computational approaches. Trends Genet. 17, 56-60. https://doi.org/10.1016/S0168-9525(00)02174-0
  14. Pedersen, A., Baldi, P., Chauvin, Y. and Brunak, S. (1999) The biology of eukaryotic promoter prediction a review. Comput. Chem. 23, 191-207. https://doi.org/10.1016/S0097-8485(99)00015-7
  15. Perier, R., Praz, V., Junier, T., Bonnard, C. and Bucher, P. (2000) The Eukaryotic Promoter Database. Nucleic Acids Res. 28, 302-303. https://doi.org/10.1093/nar/28.1.302
  16. Schneider, T. and Stephens, R. (1990) Sequence Logos: A New Way to Display Consensus Sequences. Nucleic Acids Res. 18, 6096-6100.
  17. Sinha, S. and Tompa, M. (2002) Discovery of novel transcription factor binding sites by statistical overrepresentation. Nucleic Acids Res. 30, 5549-5560. https://doi.org/10.1093/nar/gkf669

Cited by

  1. PromoterWizard: An Integrated Promoter Prediction Program Using Hybrid Methods vol.9, pp.4, 2011, https://doi.org/10.5808/GI.2011.9.4.194