DOI QR코드

DOI QR Code

Protein Disorder Prediction Using Multilayer Perceptrons

  • Oh, Sang-Hoon (Department of Information Communication Engineering Mokwon University)
  • Received : 2013.08.16
  • Accepted : 2013.12.06
  • Published : 2013.12.28

Abstract

"Protein Folding Problem" is considered to be one of the "Great Challenges of Computer Science" and prediction of disordered protein is an important part of the protein folding problem. Machine learning models can predict the disordered structure of protein based on its characteristic of "learning from examples". Among many machine learning models, we investigate the possibility of multilayer perceptron (MLP) as the predictor of protein disorder. The investigation includes a single hidden layer MLP, multi hidden layer MLP and the hierarchical structure of MLP. Also, the target node cost function which deals with imbalanced data is used as training criteria of MLPs. Based on the investigation results, we insist that MLP should have deep architectures for performance improvement of protein disorder prediction.

Keywords

1. INTRODUCTION

Proteins carry out many important functions indispensable for life and the study of protein structure is important for our understanding of many biological processes [1]. When a protein is in its functional state, it is called native. The native form of a protein is assumed to have a specific 3D structure and the loss of function is assumed to be associated with unfolding or loss of the specific 3D structure. Protein structure can be determined by the X-ray diffraction, NMR, homology alignment, and other methods [2]-[4].

The information flow from amino acid sequence to 3D structure is very important and this “protein folding problem” is considered to be one of the “Great Challenges of Computer Science” [5], [6]. The protein folding problem includes the prediction of order and disorder. A protein region is defined as disordered if it is devoid of stable secondary structure [7], [8]. Recognition of disordered regions in a protein is important for two reasons: reducing bias in sequence similarity analysis by avoiding alignment of disordered regions against ordered ones, and helping to delineate boundaries of protein domains to guide structural and functional studies [7]. Accurate recognition of disordered regions can be applied to enzyme specificity studies, function recognition, and drug design [4]. However, there are several categories of disorder such as molten globules, partially unstructured proteins, and random coils [7]. And no commonly agreed definition of protein disorder exists [2].

Intrinsically disordered proteins generally have a biased amino acid composition [7]. G, S, and P are disorder-promoting amino acids. W, F, I, Y, V and L are order-promoting amino acids, while H and T are considered neutral with respect to disorder. However, using sequence composition as the sole predictive parameter of disorder is not reliable [7].

Disordered regions can be indirectly predicted by experimental methods such as X-ray crystallography, NMR-, Raman-, CD-spectroscopy, and hydrodynamic measurements [2]. Each of these methods detects different aspects of disorder resulting in several operational definitions of protein disorder.

Alternatively, machine learning approaches to determine whether disordered regions are common have been proposed. Romeo et al. proposed PONDR method, which constructed feature extraction data through p-feature selection and PCA (principal component analysis) and then trained MLP (multilayer perceptron) using EBP (error back-propagation) algorithm [1]. In the PONDR method, general predictors are trained using all available disordered examples. Family-specific predictors are trained to predict a particular type of disorder. Also, hybrid predictor combines family-specific predictors into more general disorder predicting systems by using an arbiter neural network decision when the base predictors disagree. However, there is severe imbalance between disordered and ordered regions. The PONDR method used an artificial procedure to make the data balanced.

Yang and Thomson proposed BBFNN (bio-basis function neural network) which resembles GPFN (Gaussian potential function network) [3]. In the method, bio-basis function was designed based on homology alignment score and the weights of the final layer were calculated with pseudo-inverse method. They also proposed RONN in order to handle the variable length of disordered/ordered regions [4]. RONN has weakness particularly in the detection of short regions of disorder and in defining the first and last residues of disordered regions.

Linding et al. proposed DisEMBL which consisted of three neural networks, of which each one detects a separately defined disordered regions such as loops/coils, hot loops, and missing coordinates in X-ray structure [2]. Possibly because of the small number of positive (disordered region) samples, Linding et al. insisted that networks with many hidden nodes performed no better than those with few. So, they used only five hidden nodes but did not consider the imbalance of data to train neural networks [2].

Data imbalance is reported in a wide range of applications such as bio-medical diagnoses [9], gene ontology [10], remote sensing [11], credit assessment [12], etc. Classifiers developed under the assumption of balanced class priors show poor performance for the imbalanced data problems including the protein disorder prediction.

When dealing with the prediction of protein disorder problem, in this paper, we considered the imbalance of data to train MLPs. Also, we investigate structural possibility of MLP for the protein disorder prediction problem. In section 2, we briefly introduce the EBP algorithm of MLP and the target node method to deal with the imbalanced data in the EBP scheme. In section 3, we propose many architectures of MLP for the protein disorder prediction problem and show simulation results. Finally, section 4 concludes this paper.

 

2. ERROR BACK-PROPAGATION ALGORIHTM AND IMBALANCED DATA

Among many supervised learning models in the machine learning field, we select MLP as a predictor of disordered proteins because of its arbitrary function approximation capability [13].

Fig. 1.The architecture of a multilayer perceptron.

Consider an MLP consisting of N inputs, H hidden nodes, and M output nodes, which is denoted as an “N-H-M MLP”. When a sample x(p) = [x1(p), x2(p), …, xN(p)] (p = 1,2,…, P) is presented to the MLP, by the forward propagation, the j-th hidden node is given by

Here, wji denotes the weight connecting xi to hj and wj0 is a bias. The k-th output node is

where

Also, vk0 is a bias and vkj denotes the weight connecting hj to yk.

Let the desired output vector corresponding to the training sample x(p) be t(p) = [t1(p), t2(p), …, tM(p)], which is coded as follows:

As a distance measure between the actual and desired outputs, we usually use the squared error function for P training samples defined by

To minimize E, weights vkj’s are iteratively updated by

where

is the error signal and η is the learning rate. Also, by the backward propagation of the error signal, weights wji’s are updated by

The above weight-updating procedure is the EBP algorithm [14], which does not consider any imbalance among classes.

In the protein disorder prediction problem, the positive (disordered region) samples are much less than the negative (ordered region) samples. This imbalance severely degrades the performance of protein disorder prediction. To resolve the imbalance, Romeo et al. adopted an artificial procedure to make the data balanced [1]. Linding et al. reported that MLP with few hidden nodes is better than MLP with many hidden nodes [2]. This strategy degrades the approximation capability of MLPs [13]. Contrary to these artificial or heuristic methods, the better way is to use an algorithmic approach which was proposed to strengthen learning with regards to the positive samples [16].

Consider two-class problems with imbalanced data sets [15]. Assume that one is the minority class C1 with P1 training samples and the other is the majority class C2 with P2 training samples (P1<

In order to prevent the boundary distortion, the target node method was proposed whose error function was defined by

where n and m (n

The parameters n and m controls the updating amount of weights whether the target nodes are for the minority or majority classes. Since n

Also, in order to fix the imbalance of targets for ‘1’ and ‘-1’, δk(p)'s are regulated as γδk(p) with the parameter γ = P1 / P2 in the case that (k=1 and tk(p) = −1) or (k=2 and tk(p) = 1) [16]. Then, the associated weights are updated in proportion to the error signals, which is the same procedure as in the EBP algorithm [14].

 

3. MLPs FOR PROTEIN DISORDER PREDICTION

In this section, we train MLPs to be a protein disorder predictor. The learning algorithm is the target node method which shows better performance in imbalanced data problems [16]. Still there are many possibilities in the architecture of MLPs and we will try to find a better architecture for the protein disorder prediction.

The protein disorder prediction database was supplied from KIAS(Korea Institute for Advanced Study). A total of 215,612 feature vectors were extracted from 723 proteins with 15 window size. Each feature vector consists of 330 dimensional sequence profile, 45 dimensional secondary structure profile, 16 dimensional solvent accessibility profile, and 17 dimensional hydrophobicity profile. So, the feature vector is totally 408 dimensional.

Firstly, we simulated the protein disorder prediction with a single hidden layer MLP of 408 inputs, 20 hidden nodes, and 2 output nodes. Since the protein disorder prediction is imbalanced, we used the target node method given by Eq. (9) with n=2 and m=8 to train MLPs for 5000 epochs. Nine times simulations were conducted with initializations of MLP weights uniformly on [−1×10−4 , 1×10−4]. This initialization range of MLP weights is to avoid premature saturation phenomenon of learning [19]. In each simulation, we performed five-fold (1-out of-5) cross-validation for performance evaluation. When data is imbalanced, the total accuracy heavily depends on the accuracy of majority class and the total accuracy is not adequate as a performance measure. Accordingly, as performance criteria, we used the accuracy of minority (disordered region) class and the geometric mean of majority (ordered region) class accuracy and minority class accuracy [9]. The forty five simulation results consisted of nine times MLP initialization and five times cross-validation are averaged and the best performance during 5000 training epochs is in the Table 1. The accuracy of disordered region class and the geometric mean for training samples are 91% and 89.4%, respectively. For validation samples, the accuracy of disordered region class and the geometric mean are 79.03% and 80.46%, respectively.

Table 1.The simulation results for the 408-20-2 MLP. 408 is the number of input nodes, 20 is the number of hidden nodes, and 2 is the number of output nodes. G-Mean denotes the geometric mean

Alternatively, we tried a hierarchical architecture of MLPs. Since each input vector consists of four profiles, we allocate MLPi (i = 1,2,3,4) for each profile in the input vector. That is, MLP1 is for the 330 dimensional sequence profile, MLP2 is for the 45 dimensional secondary structure profile, MLP3 is for the 16 dimensional solvent accessibility profile, and MLP4 is for the 17 dimensional hydrophobicity profile. All these MLPi (i = 1,2,3,4) have 20 hidden nodes and two output nodes, respectively. The totally eight output node values from MLPi (i = 1,2,3,4) are presented to the judge MLP, which integrates the classification information of MLPi (i = 1,2,3,4) and makes a final decision. The architecture of judge MLP is 8-20-2. The initialization and training methods are the same with the single hidden layer MLP. The performance is evaluated after averaging of forty five simulation results and shown in Table 2.

Since each profile has different characteristics, MLPi (i = 1,2,3,4) show different performances. Among them, MLP1 and MLP3 are better and MLP4 is the worst. This means that the hydrophobicity profile is more complex than the other profiles. After integration of information from MLPi (i = 1,2,3,4) , the judge MLP improves the performance. However, the performance of judge MLP for validation samples is slightly inferior to that of single hidden layer MLP.

Table 2.The simulation results for the hierarchical MLPs. The 330-20-2 MLP1 is for the sequence profile, the 45-20-2 MLP2 is for the secondary structure profile, the 16-20-2 MLP3 is for the solvent profile, and the 17-20-2 MLP4 is for the hydrophobicity profile. The 8-20-2 judge MLP is for the final decision.

The performance of MLP depends on the number of hidden nodes as well as the number of hidden layers. Contrary to the first and second simulations which used MLPs with a single hidden layer, we tried to increase the number of hidden layers from one to three. Here, “ N −H1 −H2 −H3 −M MLP” denotes MLP with H1, H2, and H3 nodes in the first, second, and third hidden layer, respectively.

Now, we tried the architecture of MLPs such as 408-2-2-2-2, 408-4-4-4-2, and 408-20-20-20-2. The initialization and training methods are the same with the first simulation. Also, the performances are evaluated using the averages of forty five simulation results and shown in Table 3. Comparing Table 3 with Table 1, we can find that the three hidden layer MLP with 20 hidden nodes in each hidden layer attains better performance for training samples and similar performance for the validation samples. This is due to the specialization to the training samples with increased hidden nodes. That is, the (c) case in Table 3 has 20 ×3 = 60 hidden nodes.

As a final trial, we simulated the hierarchical architecture of MLPs with three hidden layers. Here, each MLPi (i = 1,2,3,4) has three hidden layers and the judge MLP also has three hidden layers. In each hidden layer of MLPi (i = 1,2,3,4) and the judge MLP, we used 20 hidden nodes. The initialization and training methods are the same with the first simulation. As in the previous simulations, the performances evaluated using the average of the forty five simulation results are in Table 4.

Table 3.The simulation results with three hidden layer MLPs, whose architectures are (a) 408-2-2-2-2 (two nodes in each hidden layer), (b) 408-4-4-4-2 (four nodes in each hidden layer), and (c) 408-20-20-20-2 (twenty nodes in each hidden layer).

Comparing Tables 2 and 4, the performance of judge MLPs is similar. However, by increasing the number of hidden layer, the Accuracy of Disorder Class of MLP4 was improved very much. Also, the (c) case in Table 3 shows a similar tendency. Thus, there are possibilities of performance improvement with increasing the number of hidden layers. Although we can improve the performance with increased hidden nodes, this causes specialization of learning to training samples and finally degradation of performance for test samples. Therefore, we pursue to increase the number of hidden layers. This argument coincides with the interest increasing of neural network community in deep belief networks [20], [21].

Table 4.The simulation results for the hierarchical MLPs with three hidden layers. MLP1 is 330-20-20-20-2, MLP2 is 45-20-20-20-2, MLP3 is 16-20-20-20-2, MLP4 is 17-20-20-20-2, and the judge MLP for the final decision is 8-20-20-20-2.

The poor performance of MLP is due to the specialization to training samples, or in some cases, MLPs cannot fit the truefunction described by the training samples. As a new strategy to resolve these problems, the deep architecture of MLP has been proposed [21]. Since the deep belief network has many hidden layers, it is very difficult to successfully train the deep network. As an initialization methodology for successful training, the RBM (Restricted Boltzmann Machine) had been proposed [20]. We tried various architectures of MLPs and the fruit we attained is that increasing the number of hidden layers can improve the performance. So, we will adopt the deep architecture of MLPs initialized with the RBM as a next challenging method to the protein disorder prediction.

 

5. CONCLUSIONS

In this paper, we investigated the possibilities of MLP as a machine learning methodology for protein disorder prediction. The single hidden layer MLP, hierarchical MLPs, three hidden layer MLP, and hierarchical MLPs with three hidden layers are simulated. Contrary to the others, we trained MLPs with the target node method which can deal with imbalanced data problems. Since the protein disorder prediction is heavily imbalanced, MLPs must be trained with the learning algorithm which is developed for the imbalanced data.

With the simulation results, it was very difficult to improve the performance for the protein disorder prediction problem. Anyway, there was a possibility that increasing the number of hidden layers can improve the performance of protein disorder prediction. This argument coincides with the high interests in the deep architecture field of neural network community. As a next step, we will try the deep belief network with initialization using RBM for performance improvement of protein disorder prediction.

References

  1. P. Romero, Z. Obradovic, and A. K. Dunker, "Intelligent data analysis for protein disorder prediction," Artificial Intelligence Review, vol. 14, 2000, pp. 447-484. https://doi.org/10.1023/A:1006678623815
  2. R. Linding, L. J. Jensen, F. Diella, P. Bork, T. J. Gibson, and R. B. Russell, "Protein disorder prediction: Implications for structural proteomics," Structure, vol. 11, 2003, pp. 1453-1459. https://doi.org/10.1016/j.str.2003.10.002
  3. Z. R. Yang and R. Thomson, "Bio-basis function neural network for prediction of protease cleavage sites in proteins," IEEE Trans. Neural Networks, vol. 16, 2005, pp. 263-274. https://doi.org/10.1109/TNN.2004.836196
  4. Z. R. Yang, R. Thomson, P. McNeil, and R. M. Esnouf, "RONN: the bio-basis function neural network technique applied to the detection of natively disordered regions in proteins," Bioinformatics, vol. 21, 2005, pp. 3369-3376. https://doi.org/10.1093/bioinformatics/bti534
  5. FCCST, Grand Challenges 1993: High performance computing and communications, A report by the committee on physical, mathematical, and engineering sciences, Federal coordinating council for science and technology.
  6. O. Noivirt-Brik, J. Prilusky, and J. L. Sussman, "Assessment of disorder predictions in CASP8," Proteins, vol. 77, 2009, pp. 210-216. https://doi.org/10.1002/prot.22586
  7. F. Ferron, S. Longhi, B. Canard, and D. Karlin, "A practical overview of protein disorder prediction methods," PROTEINS: Structure, Function, and Bioinformatics, vol. 65, 2006, pp. 1-14. https://doi.org/10.1002/prot.21075
  8. B. He, K. Wang, Y. Liu, B. Xue, V. N. Uversky, and A. K. Dunker, "Predicting intrinsic disorder in proteins: an overview," Cell Research, vol. 19, 2009, pp. 929-949. https://doi.org/10.1038/cr.2009.87
  9. P. Kang and S. Cho, "EUS SVMs: ensemble of undersampled SVMs for data imbalance problem," Proc. ICONIP'06, 2006, pp. 837-846.
  10. R. Bi, Y. Zhou, F. Lu, and W. Wang, "Predicting gene ontology functions based on support vector machines and statistical significance estimation," Neurocomputing, vol. 70, 2007, pp. 718-725. https://doi.org/10.1016/j.neucom.2006.10.006
  11. L. Bruzzone, and S. B. Serpico, "Classification of Remote-Sensing Data by Neural Networks," Pattern Recognition Letters, vol. 18, 1997, pp. 1323-1328. https://doi.org/10.1016/S0167-8655(97)00109-8
  12. Y. M. Huang, C. M. Hung, and H. C. Jiau, "Evaluation of Neural Networks and Data Mining Methods on a Credit Assessment Task for Class Imbalance Problem," Nonlinear Analysis, vol. 7, 2006, pp. 720-747. https://doi.org/10.1016/j.nonrwa.2005.04.006
  13. K. Hornik, M. Stincombe, and H. White, "Multilayer feedforward networks are universal approximators," Neural Networks, vol. 2, 1989, pp. 359-366. https://doi.org/10.1016/0893-6080(89)90020-8
  14. D. E. Rumelhart and J. L. McClelland, Parallel Distributed Processing, Cambridge, MA, 1986.
  15. N. V. Chawla, K. W. Bowyer, L. O. all, and W. P. Kegelmeyer, "SMOTE: Synthetic Minority Over-sampling Technique," J. Artificial Intelligence Research, vol. 16, 2002, pp. 321-357.
  16. S. H. Oh, "Error back-propagation algorithm for classification of imbalanced data", Neurocomputing, vol. 74, 2011, pp. 1058-1061. https://doi.org/10.1016/j.neucom.2010.11.024
  17. S. H. Oh, "Improving the Error Back-Propagation Algorithm with a Modified Error Function," IEEE Trans. Neural Networks, vol. 8, 1997, pp. 799-803. https://doi.org/10.1109/72.572117
  18. S. H. Oh, "A Statistical Perspective of Neural Networks for Imbalanced Data Problems," Int. Journal of Contents, vol. 7, no. 3, 2011, pp. 1-5.
  19. Y. Lee, S. H. Oh, and M. W. Kim, "An Analysis of Premature Saturation in Back-Propagation Learning," Neural Networks, vol. 6, 1993, pp. 719-728. https://doi.org/10.1016/S0893-6080(05)80116-9
  20. G. E. Hinton and R. R. Salakhutdinov, "Reducing the Dimensionality of Data with Neural Networks," Science, vol. 313, 2006, pp. 504-507. https://doi.org/10.1126/science.1127647
  21. Y. Bengio, "Learning Deep Architecture for AI," Foundations and Trends in Machine Learning, vol. 2, 2009, pp. 1-127. https://doi.org/10.1561/2200000006