Browse > Article
http://dx.doi.org/10.4051/ibc.2014.6.4.0003

An Approach for a Substitution Matrix Based on Protein Blocks and Physicochemical Properties of Amino Acids through PCA  

You, Youngki (School of Life Science, Handong Global University)
Jang, Inhwan (School of Life Science, Handong Global University)
Lee, Kyungro (Department of Biotechnology Yonsei, University)
Kim, Heonjoo (School of Life Science, Handong Global University)
Lee, Kwanhee (School of Life Science, Handong Global University)
Publication Information
Interdisciplinary Bio Central / v.6, no.4, 2014 , pp. 3.1-3.10 More about this Journal
Abstract
Amino acid substitution matrices are essential tools for protein sequence analysis, homology sequence search in protein databases and multiple sequence alignment. The PAM matrix was the first widely used amino acid substitution matrix. The BLOSUM series then succeeded the PAM matrix. Most substitution matrixes were developed by using the statistical frequency of substitution between each amino acid at blocks representing groups of protein families or related proteins. However, substitution of amino acids is based on the similarity of physiochemical properties of each amino acid. In this study, a new approach was used to obtain major physiochemical properties in multiple sequence alignment. Frequency of amino acid substitution in multiple sequence alignment database and selected attributes of amino acids in physiochemical properties database were merged. This merged data showed the major physiochemical properties through principle components analysis. Using factor analysis, these four principle components were interpreted as flexibility of electronic movement, polarity, negative charge and structural flexibility. Applying these four components, BAPS was constructed and validated for accuracy. When comparing receiver operated characteristic ($ROC_{50}$) values, BAPS scored slightly lower than BLOSUM and PAM. However, when evaluating for accuracy by comparing results from multiple sequence alignment with the structural alignment results of two test data sets with known three-dimensional structure in the homologous structure alignment database, the result of the test for BAPS was comparatively equivalent or better than results for prior matrices including PAM, Gonnet, Identity and Genetic code matrix.
Keywords
BAPS; factor analysis; principle component analysis; scoring matrix; sequence alignment;
Citations & Related Records
연도 인용수 순위
  • Reference
1 Liu, X., and Zhao, Y. P. (2010). Substitution matrices of residue triplets derived from protein blocks. Journal of computational biology: a journal of computational molecular cell biology 17, 1679-1687.   DOI
2 Xu, H., Ren, W., Liu, X., and Li, X. (2010). Aligning protein sequence and analysing substitution pattern using a class-specific matrix. J Biosci 35, 295-314.   DOI
3 Atchley, W. R., Zhao, J., Fernandes, A. D., and Druke, T. (2005). Solving the protein sequence metric problem. Proceedings of the National Academy of Sciences of the United States of America 102, 6395-6400.   DOI
4 Thompson, J. D., Higgins, D. G., and Gibson, T. J. (1994). CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic acids research 22, 4673-4680.   DOI   ScienceOn
5 Edgar, R. C., and Batzoglou, S. (2006). Multiple sequence alignment. Current opinion in structural biology 16, 368-373.   DOI
6 Larkin, M. A., Blackshields, G., Brown, N. P., Chenna, R., McGettigan, P. A., McWilliam, H., Valentin, F., Wallace, I. M., Wilm, A., Lopez, R., et al. (2007). Clustal W and Clustal X version 2.0. Bioinformatics 23, 2947-2948.   DOI   ScienceOn
7 Edgar, R. C. (2004). MUSCLE: a multiple sequence alignment method with reduced time and space complexity. BMC bioinformatics 5, 113.   DOI   ScienceOn
8 Wheeler, D. (2002). Selecting the right protein-scoring matrix. Current protocols in bioinformatics/editoral board, Andreas D Baxevanis [et al] Chapter 3, Unit 3 5.
9 Biro, J. C. (2006). Amino acid size, charge, hydropathy indices and matrices for protein structure analysis. Theoretical biology & medical modelling 3, 15.   DOI
10 Johnson, R. A. W. D. W. (2002). Applied multivariate statistical analysis. Upper Saddle River, N.J.: Prentice Hall.
11 Schaffer, A. A., Aravind, L., Madden, T. L., Shavirin, S., Spouge, J. L., Wolf, Y. I., Koonin, E. V., and Altschul, S. F. (2001). Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements. Nucleic acids research 29, 2994-3005.   DOI   ScienceOn
12 Farrar, M. (2007). Striped Smith-Waterman speeds database searches six times over other SIMD implementations. Bioinformatics 23, 156-161.   DOI
13 Colliver, J. A., Barnhart, A. J., Marcy, M. L., and Verhulst, S. J. (1994). Using a receiver operating characteristic (ROC) analysis to set passing standards for a standardized-patient examination of clinical competence. Academic Medicine 69, S37-39.   DOI
14 Lipman, D. J., Altschul, S. F., and Kececioglu, J. D. (1989). A tool for multiple sequence alignment. Proceedings of the National Academy of Sciences of the United States of America 86, 4412-4415.   DOI
15 Stebbings, L. A., and Mizuguchi, K. (2004). HOMSTRAD: recent developments of the Homologous Protein Structure Alignment Database. Nucleic acids research 32, D203-207.   DOI
16 Mohana Rao, J. K. (1987). New scoring matrix for amino acid residue exchanges based on residue characteristic physical parameters. International journal of peptide and protein research 29, 276-281.
17 Dayhoff, M. O. E. R. V. N. B. R. F. (1968). Atlas of protein sequence and structure. Silver Spring, Md.: National Biomedical Research Foundation.
18 McLachlan, A. D. (1971). Tests for comparing related amino-acid sequences. Cytochrome c and cytochrome c 551. Journal of molecular biology 61, 409-424.   DOI
19 Feng, D. F., Johnson, M. S., and Doolittle, R. F. (1985). Aligning amino acid sequences: comparison of commonly used methods. Journal of molecular evolution 21, 112-125.   DOI
20 Risler, J. L., Delorme, M. O., Delacroix, H., and Henaut, A. (1988). Amino acid substitutions in structurally related proteins. A pattern recognition approach. Determination of a new and efficient scoring matrix. Journal of molecular biology 204, 1019-1029.   DOI
21 Smith, R. F., and Smith, T. F. (1990). Automatic generation of primary sequence patterns from sets of related protein sequences. Proceedings of the National Academy of Sciences of the United States of America 87, 118-122.   DOI
22 Dayhoff, M. O. N. B. R. F. (1978). Atlas of protein sequence and structure. Washington, D.C.: National Biomedical Research Foundation.
23 Johnson, M. S., and Overington, J. P. (1993). A structural basis for sequence comparisons. An evaluation of scoring methodologies. Journal of molecular biology 233, 716-738.   DOI
24 Henikoff, S., and Henikoff, J. G. (1992). Amino acid substitution matrices from protein blocks. Proceedings of the National Academy of Sciences of the United States of America 89, 10915-10919.   DOI   ScienceOn
25 Bowie, J., Luthy, R., and Eisenberg, D. (1991). A method to identify protein sequences that fold into a known three-dimensional structure. Science 253, 164-170.   DOI
26 Liu, X., Zhang, L. M., Guan, S., and Zheng, W. M. (2003). Distances and classification of amino acids for different protein secondary structures. Physical review E, Statistical, nonlinear, and soft matter physics 67, 051927.   DOI
27 Prlic, A., Domingues, F. S., and Sippl, M. J. (2000). Structure-derived substitution matrices for alignment of distantly related sequences. Protein engineering 13, 545-550.   DOI
28 Liu, X., and Zheng, W. M. (2006). An amino acid substitution matrix for protein conformation identification. Journal of bioinformatics and computational biology 4, 769-782.   DOI
29 Teodorescu, O., Galor, T., Pillardy, J., and Elber, R. (2004). Enriching the sequence substitution matrix by structural information. Proteins 54, 41-48.
30 Xu, W., and Miranker, D. P. (2004). A metric model of amino acid substitution. Bioinformatics 20, 1214-1221.   DOI
31 Eyal, E., Frenkel-Morgenstern, M., Sobolev, V., and Pietrokovski, S. (2007). A pair-to-pair amino acids substitution matrix and its applications for protein structure prediction. Proteins 67, 142-153.   DOI
32 Kawashima, S., Pokarowski, P., Pokarowska, M., Kolinski, A., Katayama, T., and Kanehisa, M. (2008). AAindex: amino acid index database, progress report 2008. Nucleic acids research 36, D202-205.   DOI
33 Wrabl, J. O., and Grishin, N. V. (2005). Grouping of amino acid types and extraction of amino acid properties from multiple sequence alignments using variance maximization. Proteins: Structure, Function, and Bioinformatics 61, 523-534.   DOI
34 Henikoff, J. G., Greene, E. A., Pietrokovski, S., and Henikoff, S. (2000). Increased coverage of protein families with the blocks database servers. Nucleic acids research 28, 228-230.   DOI   ScienceOn
35 Kawashima, S., Ogata, H., and Kanehisa, M. (1999). AAindex: Amino Acid Index Database. Nucleic acids research 27, 368-369.   DOI   ScienceOn
36 Chothia, C. (1976). The nature of the accessible and buried surfaces in proteins. Journal of molecular biology 105, 1-12.   DOI
37 Charton, M., and Charton, B. I. (1982). The structural dependence of amino acid hydrophobicity parameters. Journal of theoretical biology 99, 629-644.   DOI
38 Charton, M., and Charton, B. I. (1983). The dependence of the Chou-Fasman parameters on amino acid side chain structure. Journal of theoretical biology 102, 121-134.   DOI
39 Prabhakaran, M. (1990). The distribution of physical, chemical and conformational properties in signal and nascent peptides. The Biochemical journal 269, 691-696.   DOI
40 Komatsu, D. (2001). Protein folding recognition based on amino acid physicochemical property profiles. Genome Informatics 12, 358-359.