Sequence driven features for prediction of subcellular localization of proteins

  • Kim, Jong-Kyoung (Department of Computer Science Pohang University of Science and Technology) ;
  • Bang, Sung-Yang (Department of Computer Science Pohang University of Science and Technology) ;
  • Choi, Seung-Jin (Department of Computer Science Pohang University of Science and Technology)
  • Published : 2005.09.22

Abstract

Predicting the cellular location of an unknown protein gives a valuable information for inferring the possible function of the protein. For more accurate prediction system, we need a good feature extraction method that transforms the raw sequence data into the numerical feature vector, minimizing information loss. In this paper, we propose new methods of extracting underlying features only from the sequence data by computing pairwise sequence alignment scores. In addition, we use composition based features to improve prediction accuracy. To construct an SVM ensemble from separately trained SVM classifiers, we propose specificity based weighted majority voting. The overall prediction accuracy evaluated by the 5-fold cross-validation reached 88.53% for the eukaryotic animal data set. By comparing the prediction accuracy of various feature extraction methods, we could get the biological insight on the location of targeting information. Our numerical experiments confirm that our new feature extraction methods are very useful for predicting subcellular localization of proteins.

Keywords