The Grammatical Structure of Protein Sequences

  • Published : 2000.11.01

Abstract

We describe a hidden Markov model, HMMTIR, for general protein sequence based on the I-sites library of sequence-structure motifs. Unlike the linear HMMs used to model individual protein families, HMMSTR has a highly branched topology and captures recurrent local features of protein sequences and structures that transcend protein family boundaries. The model extends the I-sites library by describing the adjacencies of different sequence-structure motifs as observed in the database, and achieves a great reduction in parameters by representing overlapping motifs in a much more compact form. The HMM attributes a considerably higher probability to coding sequence than does an equivalent dipeptide model, predicts secondary structure with an accuracy of 74.6% and backbone torsion angles better than any previously reported method, and predicts the structural context of beta strands and turns with an accuracy that should be useful for tertiary structure prediction. HMMSTR has been incorporated into a public, fully-automated protein structure prediction server.

Keywords