An Approach for a Substitution Matrix Based on Protein Blocks and Physicochemical Properties of Amino Acids through PCA

  • You, Youngki (School of Life Science, Handong Global University) ;
  • Jang, Inhwan (School of Life Science, Handong Global University) ;
  • Lee, Kyungro (Department of Biotechnology Yonsei, University) ;
  • Kim, Heonjoo (School of Life Science, Handong Global University) ;
  • Lee, Kwanhee (School of Life Science, Handong Global University)
  • Received : 2014.02.25
  • Accepted : 2014.08.29
  • Published : 2014.12.31


Amino acid substitution matrices are essential tools for protein sequence analysis, homology sequence search in protein databases and multiple sequence alignment. The PAM matrix was the first widely used amino acid substitution matrix. The BLOSUM series then succeeded the PAM matrix. Most substitution matrixes were developed by using the statistical frequency of substitution between each amino acid at blocks representing groups of protein families or related proteins. However, substitution of amino acids is based on the similarity of physiochemical properties of each amino acid. In this study, a new approach was used to obtain major physiochemical properties in multiple sequence alignment. Frequency of amino acid substitution in multiple sequence alignment database and selected attributes of amino acids in physiochemical properties database were merged. This merged data showed the major physiochemical properties through principle components analysis. Using factor analysis, these four principle components were interpreted as flexibility of electronic movement, polarity, negative charge and structural flexibility. Applying these four components, BAPS was constructed and validated for accuracy. When comparing receiver operated characteristic ($ROC_{50}$) values, BAPS scored slightly lower than BLOSUM and PAM. However, when evaluating for accuracy by comparing results from multiple sequence alignment with the structural alignment results of two test data sets with known three-dimensional structure in the homologous structure alignment database, the result of the test for BAPS was comparatively equivalent or better than results for prior matrices including PAM, Gonnet, Identity and Genetic code matrix.



  1. Dayhoff, M. O. E. R. V. N. B. R. F. (1968). Atlas of protein sequence and structure. Silver Spring, Md.: National Biomedical Research Foundation.
  2. McLachlan, A. D. (1971). Tests for comparing related amino-acid sequences. Cytochrome c and cytochrome c 551. Journal of molecular biology 61, 409-424.
  3. Feng, D. F., Johnson, M. S., and Doolittle, R. F. (1985). Aligning amino acid sequences: comparison of commonly used methods. Journal of molecular evolution 21, 112-125.
  4. Mohana Rao, J. K. (1987). New scoring matrix for amino acid residue exchanges based on residue characteristic physical parameters. International journal of peptide and protein research 29, 276-281.
  5. Risler, J. L., Delorme, M. O., Delacroix, H., and Henaut, A. (1988). Amino acid substitutions in structurally related proteins. A pattern recognition approach. Determination of a new and efficient scoring matrix. Journal of molecular biology 204, 1019-1029.
  6. Smith, R. F., and Smith, T. F. (1990). Automatic generation of primary sequence patterns from sets of related protein sequences. Proceedings of the National Academy of Sciences of the United States of America 87, 118-122.
  7. Dayhoff, M. O. N. B. R. F. (1978). Atlas of protein sequence and structure. Washington, D.C.: National Biomedical Research Foundation.
  8. Henikoff, S., and Henikoff, J. G. (1992). Amino acid substitution matrices from protein blocks. Proceedings of the National Academy of Sciences of the United States of America 89, 10915-10919.
  9. Bowie, J., Luthy, R., and Eisenberg, D. (1991). A method to identify protein sequences that fold into a known three-dimensional structure. Science 253, 164-170.
  10. Liu, X., Zhang, L. M., Guan, S., and Zheng, W. M. (2003). Distances and classification of amino acids for different protein secondary structures. Physical review E, Statistical, nonlinear, and soft matter physics 67, 051927.
  11. Johnson, M. S., and Overington, J. P. (1993). A structural basis for sequence comparisons. An evaluation of scoring methodologies. Journal of molecular biology 233, 716-738.
  12. Prlic, A., Domingues, F. S., and Sippl, M. J. (2000). Structure-derived substitution matrices for alignment of distantly related sequences. Protein engineering 13, 545-550.
  13. Liu, X., and Zheng, W. M. (2006). An amino acid substitution matrix for protein conformation identification. Journal of bioinformatics and computational biology 4, 769-782.
  14. Teodorescu, O., Galor, T., Pillardy, J., and Elber, R. (2004). Enriching the sequence substitution matrix by structural information. Proteins 54, 41-48.
  15. Xu, W., and Miranker, D. P. (2004). A metric model of amino acid substitution. Bioinformatics 20, 1214-1221.
  16. Eyal, E., Frenkel-Morgenstern, M., Sobolev, V., and Pietrokovski, S. (2007). A pair-to-pair amino acids substitution matrix and its applications for protein structure prediction. Proteins 67, 142-153.
  17. Liu, X., and Zhao, Y. P. (2010). Substitution matrices of residue triplets derived from protein blocks. Journal of computational biology: a journal of computational molecular cell biology 17, 1679-1687.
  18. Xu, H., Ren, W., Liu, X., and Li, X. (2010). Aligning protein sequence and analysing substitution pattern using a class-specific matrix. J Biosci 35, 295-314.
  19. Atchley, W. R., Zhao, J., Fernandes, A. D., and Druke, T. (2005). Solving the protein sequence metric problem. Proceedings of the National Academy of Sciences of the United States of America 102, 6395-6400.
  20. Thompson, J. D., Higgins, D. G., and Gibson, T. J. (1994). CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic acids research 22, 4673-4680.
  21. Edgar, R. C., and Batzoglou, S. (2006). Multiple sequence alignment. Current opinion in structural biology 16, 368-373.
  22. Larkin, M. A., Blackshields, G., Brown, N. P., Chenna, R., McGettigan, P. A., McWilliam, H., Valentin, F., Wallace, I. M., Wilm, A., Lopez, R., et al. (2007). Clustal W and Clustal X version 2.0. Bioinformatics 23, 2947-2948.
  23. Edgar, R. C. (2004). MUSCLE: a multiple sequence alignment method with reduced time and space complexity. BMC bioinformatics 5, 113.
  24. Wheeler, D. (2002). Selecting the right protein-scoring matrix. Current protocols in bioinformatics/editoral board, Andreas D Baxevanis [et al] Chapter 3, Unit 3 5.
  25. Wrabl, J. O., and Grishin, N. V. (2005). Grouping of amino acid types and extraction of amino acid properties from multiple sequence alignments using variance maximization. Proteins: Structure, Function, and Bioinformatics 61, 523-534.
  26. Henikoff, J. G., Greene, E. A., Pietrokovski, S., and Henikoff, S. (2000). Increased coverage of protein families with the blocks database servers. Nucleic acids research 28, 228-230.
  27. Kawashima, S., Ogata, H., and Kanehisa, M. (1999). AAindex: Amino Acid Index Database. Nucleic acids research 27, 368-369.
  28. Kawashima, S., Pokarowski, P., Pokarowska, M., Kolinski, A., Katayama, T., and Kanehisa, M. (2008). AAindex: amino acid index database, progress report 2008. Nucleic acids research 36, D202-205.
  29. Chothia, C. (1976). The nature of the accessible and buried surfaces in proteins. Journal of molecular biology 105, 1-12.
  30. Charton, M., and Charton, B. I. (1982). The structural dependence of amino acid hydrophobicity parameters. Journal of theoretical biology 99, 629-644.
  31. Charton, M., and Charton, B. I. (1983). The dependence of the Chou-Fasman parameters on amino acid side chain structure. Journal of theoretical biology 102, 121-134.
  32. Prabhakaran, M. (1990). The distribution of physical, chemical and conformational properties in signal and nascent peptides. The Biochemical journal 269, 691-696.
  33. Komatsu, D. (2001). Protein folding recognition based on amino acid physicochemical property profiles. Genome Informatics 12, 358-359.
  34. Biro, J. C. (2006). Amino acid size, charge, hydropathy indices and matrices for protein structure analysis. Theoretical biology & medical modelling 3, 15.
  35. Johnson, R. A. W. D. W. (2002). Applied multivariate statistical analysis. Upper Saddle River, N.J.: Prentice Hall.
  36. Schaffer, A. A., Aravind, L., Madden, T. L., Shavirin, S., Spouge, J. L., Wolf, Y. I., Koonin, E. V., and Altschul, S. F. (2001). Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements. Nucleic acids research 29, 2994-3005.
  37. Farrar, M. (2007). Striped Smith-Waterman speeds database searches six times over other SIMD implementations. Bioinformatics 23, 156-161.
  38. Colliver, J. A., Barnhart, A. J., Marcy, M. L., and Verhulst, S. J. (1994). Using a receiver operating characteristic (ROC) analysis to set passing standards for a standardized-patient examination of clinical competence. Academic Medicine 69, S37-39.
  39. Lipman, D. J., Altschul, S. F., and Kececioglu, J. D. (1989). A tool for multiple sequence alignment. Proceedings of the National Academy of Sciences of the United States of America 86, 4412-4415.
  40. Stebbings, L. A., and Mizuguchi, K. (2004). HOMSTRAD: recent developments of the Homologous Protein Structure Alignment Database. Nucleic acids research 32, D203-207.

Cited by

  1. Revealing the cellular degradome by mass spectrometry analysis of proteasome-cleaved peptides pp.1546-1696, 2018,
  2. Analyses on clustering of the conserved residues at protein-RNA interfaces and its application in binding site identification vol.21, pp.1, 2014,