Recent Progresses in the Linguistic Modeling of Biological Sequences Based on Formal Language Theory

Park, Hyun-Seok;Galbadrakh, Bulgan;Kim, Young-Mi;

doi:10.5808/GI.2011.9.1.005

Genomics & Informatics

Volume 9 Issue 1
/
Pages.5-11
/
2011
/
1598-866X(pISSN)
/
2234-0742(eISSN)

Korea Genome Organization (한국유전체학회)

DOI QR Code

Recent Progresses in the Linguistic Modeling of Biological Sequences Based on Formal Language Theory

Park, Hyun-Seok (Bioinformatics Laboratory, School of Engineering, Ewha Womans University) ;
Galbadrakh, Bulgan (Bioinformatics Laboratory, School of Engineering, Ewha Womans University) ;
Kim, Young-Mi (Natural Language Processing Laboratory, School of Natural Science, Huree Institute of Information and Communication Technology)

Accepted : 2011.03.02
Published : 2011.03.31

https://doi.org/10.5808/GI.2011.9.1.005 Citation PDF KSCI

Download PDF

⟨ Previous Next ⟩

Abstract

Treating genomes just as languages raises the possibility of producing concise generalizations about information in biological sequences. Grammars used in this way would constitute a model of underlying biological processes or structures, and that grammars may, in fact, serve as an appropriate tool for theory formation. The increasing number of biological sequences that have been yielded further highlights a growing need for developing grammatical systems in bioinformatics. The intent of this review is therefore to list some bibliographic references regarding the recent progresses in the field of grammatical modeling of biological sequences. This review will also contain some sections to briefly introduce basic knowledge about formal language theory, such as the Chomsky hierarchy, for non-experts in computational linguistics, and to provide some helpful pointers to start a deeper investigation into this field.

Keywords

References

Abe, N., and Mamitsuka, H. (1999). A New Method for Predicting Protein Secondary Structures Based on Stochastic Tree Grammars. Proc. 11th Int'l Conf. Machine Learning 3-11.
Agarwal, S., Vaz, C., Bhattacharya, A., and Srinivasan, A. (2010). Prediction of novel precursor miRNAs using context- sensitive hidden Markov model (CSHMM). BMC Bioinformatics 11(suppl 1), S29.
Apostolico, A., and Lonardi, S. (2000). Off-line compression by greedy textual substitution. Proceedings of the IEEE. 88, 1733-1744. https://doi.org/10.1109/5.892709
Apostolico, A., and Lonardi, S. (2000). Compression of biological sequences by greedy off-line textual substitution. In: Data Compression Conference. 143-153.
Baquero, F. (2004). From pieces to patterns: evolutionary engineering in bacterial pathogens. Nat. Rev. Microbiol. 2, 510-518. https://doi.org/10.1038/nrmicro909
Brazma, A., Jonassen, I., Eidhammer, I., and Gilbert, D. (1998). Approaches to the automatic discovery of patterns in biosequences. J. Computat. Biol. 5, 279-305. https://doi.org/10.1089/cmb.1998.5.279
Cai, L., Malmberg, R.L., and Wu, Y. (2003). Stochastic Modeling of RNA Pseudoknotted Structures: A Grammatical Approach. Bioinformatics 19, 66-73. https://doi.org/10.1093/bioinformatics/btg1007
Carrascosa, R., Coste, F., GallZ, M., and Infante-Lopez, G. (2011). Searching for smallest grammars on dna sequences. Journal of Discrete Algorithms, Elsevier (to be published).
Cherniavsky, N., and Ladner, R.E. (2004). Grammar-based compression of DNA Sequences. UW CSE Technical Report (TR2007-05-02), presented at the DIMACS Working Group on the Burrows-Wheeler Transform.
Chomsky, N. (1957). Syntactic Structures. Mouton and Co.
Chuong, B.D., Daniel, A.W., and Serafim, B. (2006). RNA secondary structure prediction without physics-based models. Bioinformatics 22, e90-e98. https://doi.org/10.1093/bioinformatics/btl246
Collado-Vides, J. (1992). Grammatical model of the regulation of gene expression. Proc. Natl. Acad. Sci. USA, 89, 9405-9409. https://doi.org/10.1073/pnas.89.20.9405
Coste, F., and Kerbellec, G. (2005). A similar fragments merging approach to learn automata on proteins. In Gama, J., Camacho, R., Brazdil, P., Jorge, A., Torgo, L., eds. ECML. Volume 3720 of Lecture Notes in Computer Science., Springer. 522-529.
Coste, F. (2010). Biological Sequences by Grammatical Inference, author manuscript, published in ICGI 2010 Tutorial Day, Valencia: Espagne (http://www.irisa.fr/symbiose/francois_coste)
Dowell, R.D., and Eddy, S.R. (2004). Evaluation of several lightweight stochastic context-free grammars for RNA secondary structure prediction. BMC Bioinformatics 5,
Durbin, R., Eddy, S., Krogh, A., and Mitchison, G. (1998). Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge Univ. Press.
Dyrka, W., and Nebel, J.C. (2009). A stochastic context-free grammar based framework for analysis of protein sequences. BMC Bioinformatics 10, 323. https://doi.org/10.1186/1471-2105-10-323
Eddy, S.R., and Durbin, R. (1994). RNA sequence analysis using covariance models. Nucl. Acids Res. 22, 2079-2088. https://doi.org/10.1093/nar/22.11.2079
Eddy, S.R. (1998). Profile hidden Markov models. Bioinformatics 14, 755-763. https://doi.org/10.1093/bioinformatics/14.9.755
Gardner, P.P., Daub, J., Tate, J.G., Nawrocki, E.P., Kolbe, D.L., Lindgreen, S., Wilkinson, A.C., Finn, R.D., Gri_ths-Jones, S., Eddy, S.R., and Bateman, A. (2009). Updates to the RNA families database. Nucl. Acids. Res. 37, 136-140.
Griffiths-Jones, S., Bateman, A., Marshall, Ml, Khanna, A., and Eddy, S.R. (2003). Rfam: an RNA family database. Nucl. Acids Res. 31, 439-441. https://doi.org/10.1093/nar/gkg006
Head, T. (1987). Formal Language Theory and DNA: An Analysis of the Generative Capacity of Specific Recombinant Behaviors. Bull. math. Biol. 49, 737-759. https://doi.org/10.1007/BF02481771
Holmes, I., and Rubin, G. (2002). Pairwise RNA structure comparison with stochastic context-free grammars. In Proceedings of 5th Pacific Symposium on Biocomputing. World Scientific Press, Singapore, pp. 163-174.
Holmes, I. (2005). Accelerated probabilistic inference of RNA structure evolution. BMC Bioinformatics 24, 6-73.
Hopcroft, J.E., and Ullman, J.D. (1979). Introduction to Automata Theory, Languages, and Computation, Addison-Wesley Publishing, Reading Massachusetts, ISBN 0-201-029880-X.
Hulo, N., Bairoch, A., Bulliard, V., Cerutti, L., De Castro E., Langendijk-Genevaux, P.S., Pagni, M., and Sigrist, C.J.A. (2006). The PROSITE database. Nucl. Acids. Res. 34, D227-D230. https://doi.org/10.1093/nar/gkj063
Joshi, A.K., Levy, L.S., and Takahashi, M. (1975). Tree adjunct grammars. J. Computer & System Sciences 10, 136-163. https://doi.org/10.1016/S0022-0000(75)80019-5
Leung, S.W., Mellish, C., and Robertson, D. (2001). Basic Gene Grammars and DNA-Chart Parser for language processing of Escherichia coli promoter DNA sequences, Bioinformatics 17, 226-236. https://doi.org/10.1093/bioinformatics/17.3.226
Knudsen, B., and Hein, J. (1999). RNA secondary structure prediction using stochastic context-free grammars and evolutionary history. Bioinformatics 15, 446-454. https://doi.org/10.1093/bioinformatics/15.6.446
Krogh, A., Brown, M., Mian, I.S., Sjolander, K., and Haussler, D. (1994). Hidden Markov models in computational biology. Applications to protein modeling, J. Mol. Biol. 235, 1501-1531. https://doi.org/10.1006/jmbi.1994.1104
Lanctot, J.K., Li, M., and Yang, E.H. (2000). Estimating DNA sequence entropy. In ACMSIAM Symposium on Discrete Algorithms. 409-418.
Liew, A.W., Yan, H., and Yang, M. (2005). Pattern recognition techniques for the emerging field of bioinformatics. A review, Pattern Recognition 38, 2055-2073. https://doi.org/10.1016/j.patcog.2005.02.019
Matsui, H., Sato, K., and Sakakibara, Y. (2005). Pair Stochastic Tree Adjoining Grammars for Aligning and Predicting Pseudoknot RNA Structures. Bioinformatics 21, 2611-2617. https://doi.org/10.1093/bioinformatics/bti385
Nevill-Manning, C.G., and Witten, I.H. (1997). Compression and explanation using hierarchical grammars. The Computer Journal 40, 103-116. https://doi.org/10.1093/comjnl/40.2_and_3.103
Partridge, S.R., Tsafnat, G., Coiera, E., and Iredell, J. (2009). Gene cassettes and cassette arrays inmobile resistance integrons, FEMS Microbiol. Rev. 33, 757-784. https://doi.org/10.1111/j.1574-6976.2009.00175.x
Pereira, F., and Warren, D. (1980). Definite clause grammars for language analysis. Artif. Intell., 13, 231-278. https://doi.org/10.1016/0004-3702(80)90003-X
Peris, P., L'opez, D., Campos, M., and Sempere, J.M. (2006). Protein motif prediction by grammatical inference. In LNCS (LNAI). Sakakibara, Y., Kobayashi, S., Sato, K., Nishino, T., Tomita, E., eds. (Springer: Heidelberg) 4201, pp. 175-187.
Peris, P., L'opez, D., and Campos, M. (2008). IgTM: An algorithm to predict transmembrane domains and topology in proteins. BMC Bioinformatics 9, 367. https://doi.org/10.1186/1471-2105-9-367
Reidys, M., Huang, W.D., Andersen, E., Penner, C., Stadler, F., and Nebel, E. (2011). Topology and prediction of RNA pseudoknots. Bioinformatics advance access, doi:10.1093/bioinformatics/btr090.
Rivas, E., and Eddy, S. (2000). The Language of RNA: A Formal Grammar That Includes Pseudoknots. Bioinformatics 16, 334-340. https://doi.org/10.1093/bioinformatics/16.4.334
Rivas, E., and Eddy, S. (2001). Noncoding RNA gene detection using comparative sequence analysis. BMC Bioinformatics 2, 8. https://doi.org/10.1186/1471-2105-2-8
Rosenblueth, D., Thieffry, D., Huerta, A., Salgado, H., and Collado-Vides, J. (1996). Syntactic recognition of regulatory regions in Escherichia coli. Comput. Appl. Biosci. 12, 415-422.
Sakakibara, Y. (2003). Pair Hidden Markov Models on Tree Structures. Bioinformatics 19, 232-240. https://doi.org/10.1093/bioinformatics/btg1032
Sakakibara, Y. (2005). Grammatical Inference in Bioinformatics Bioinformatics. IEEE Transactions on Pattern Analysis and Machine Intelligence. 27, 1051-1062. https://doi.org/10.1109/TPAMI.2005.140
Searls, D.B. (1988). Representing Genetic Information with Formal Grammars. In Proceedings of the 7th National Conference on Artificial Intelligence. 386-391.
Searls, D. (1993). The computational linguistics of biological sequences. In Artificial Intelligence and Molecular Biology, chapter 2, Hunter, L., ed. (MIT Press: Boston, MA), pp. 47-120.
Searls, D.B. (2002). The language of genes. Nature 420, 211-217. https://doi.org/10.1038/nature01255
Sigrist, C.J.A., De Castro, E., Langendijk-Genevaux, P.S., Le Saux, V., Bairoch, A., and Hulo, N. (2005). ProRule: a new database containing functional and structural information on PROSITE profiles. Bioinformatics 21, 4060-4066. https://doi.org/10.1093/bioinformatics/bti614
Tsafnat, G., Coiera, E., Partridge, S.R., Schaeffer, J., and Iredell, J.R. (2009). Context-driven discovery of gene cassettes in mobile integrons using a computational grammar. BMC Bioinformatics 10, 281. https://doi.org/10.1186/1471-2105-10-281
Tsafnat, G., Schaeffer, J., Clayphan, A., Iredell, J.R., Partridge, S.R., and Coiera, E. (2011). Computational inference of grammars for larger-than-gene structures from annotated gene sequences. Bioinformatics 27, 791-796. https://doi.org/10.1093/bioinformatics/btr036
Uemura, Y., Hasegawa, A., Kobayashi, S., and Yokomori, T. (1999). Tree-Adjoining Grammars for RNA Structure Prediction. Theoretical Computer Science 10, 277-303.
Yokomori, T., Ishida, N., and Kobayashi, S. (1994). Learning local languages and its application to protein $\alpha$-chain identification. In: System Sciences, vol.5: Biotechnology Computing, Proceedings of the Twenty-Seventh Hawaii International Conference. 113-122.
Yoon, B.J., and Vaidyanathan, P.P. (2004). RNA secondary structure prediction using context-sensitive hidden Markov models. Proceedings of IEEE International Workshop on Biomedical Circuits and Systems (BioCAS): Dec. 2004, Singapore. https://doi.org/10.1109/BIOCAS.2004.1454177

Cited by

Developing JSequitur to Study the Hierarchical Structure of Biological Sequences in a Grammatical Inference Framework of String Compression Algorithms vol.10, pp.4, 2012, https://doi.org/10.5808/GI.2012.10.4.266
A Composite Method Based on Formal Grammar and DNA Structural Features in Detecting Human Polymerase II Promoter Region vol.8, pp.2, 2013, https://doi.org/10.1371/journal.pone.0054843
A Review of Three Different Studies on Hidden Markov Models for Epigenetic Problems: A Computational Perspective vol.12, pp.4, 2014, https://doi.org/10.5808/GI.2014.12.4.145
Building the Frequency Profile of the Core Promoter Element Patterns in the Three ChromHMM Promoter States at 200bp Intervals: A Statistical Perspective vol.13, pp.4, 2015, https://doi.org/10.5808/GI.2015.13.4.152
A Short Report on the Markov Property of DNA Sequences on 200-bp Genomic Units of ENCODE/Broad ChromHMM Annotations: A Computational Perspective vol.16, pp.3, 2018, https://doi.org/10.5808/GI.2018.16.3.65
Probabilistic grammatical model for helix‐helix contact site classification vol.8, pp.1, 2013, https://doi.org/10.1186/1748-7188-8-31

Genomics & Informatics

Recent Progresses in the Linguistic Modeling of Biological Sequences Based on Formal Language Theory

Abstract

Keywords

References

Cited by

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)