DOI QR코드

DOI QR Code

A Performance Comparison of Multi-Label Classification Methods for Protein Subcellular Localization Prediction

단백질의 세포내 위치 예측을 위한 다중레이블 분류 방법의 성능 비교

  • Chi, Sang-Mun (School of Computer Science and Engineering, Kyungsung University)
  • Received : 2014.02.25
  • Accepted : 2014.04.07
  • Published : 2014.04.30

Abstract

This paper presents an extensive experimental comparison of a variety of multi-label learning methods for the accurate prediction of subcellular localization of proteins which simultaneously exist at multiple subcellular locations. We compared several methods from three categories of multi-label classification algorithms: algorithm adaptation, problem transformation, and meta learning. Experimental results are analyzed using 12 multi-label evaluation measures to assess the behavior of the methods from a variety of view-points. We also use a new summarization measure to find the best performing method. Experimental results show that the best performing methods are power-set method pruning a infrequently occurring subsets of labels and classifier chains modeling relevant labels with an additional feature. futhermore, ensembles of many classifiers of these methods enhance the performance further. The recommendation from this study is that the correlation of subcellular locations is an effective clue for classification, this is because the subcellular locations of proteins performing certain biological function are not independent but correlated.

단백질이 존재하는 세포내의 다중 위치를 정확하게 예측하기 위하여 다중레이블 학습 방법을 광범위하게 비교한다. 이를 위하여 다중레이블 분류의 접근 방법인 알고리즘 적응, 문제 변환, 메타 학습의 여러 방법을 비교 평가한다. 다양한 관점에서 다중레이블 분류 방법의 특성을 평가하기 위하여 12가지 평가 척도를 사용하였고, 최적의 성능을 보이는 방법을 찾기 위하여 새로운 요약 척도를 사용하였다. 비교 실험 결과, 흔하지 않은 다중레이블 집합을 가지치기 하는 멱집합 방법과, 관련 레이블들을 추가된 특징으로 나타내는 분류기-체인 방법의 성능이 높았다. 또한, 이들 방법들로 구성된 여러 개의 분류기를 조합하면 더욱 성능이 향상되었다. 즉, 세포내 위치간의 연관관계를 사용하는 것이 예측에 효과적인데, 특정 생물학적 기능을 수행하는 단백질의 세포내 위치들의 관계는 독립적이지 않고 서로 관련되어 있기 때문이라 판단된다.

Keywords

References

  1. H.-B. Shen and K.-C. Chou, "A top-down approach to enhance the power of predicting human protein subcellular localization: Hum-mPLoc 2.0," Anaytical Biochemistry, vol. 394, no. 2, pp. 269-274, 2009. https://doi.org/10.1016/j.ab.2009.07.046
  2. S.-M. Chi and D. Nam, "WegoLoc: accurate prediction of protein subcellular localization using weighted gene ontology terms," Bioinformatics, vol. 28, no. 7, pp. 1028- 1030, 2012. https://doi.org/10.1093/bioinformatics/bts062
  3. J. He, H. Gu, and W. Liu, "Imbalanced multi-modal multilabel learning for subcellular localization prediction of human proteins with both single and multiple sites," Plos One, vol. 7, no. 6, e37155, 2012. https://doi.org/10.1371/journal.pone.0037155
  4. S. Mei, "Multi-label multi-kernel transfer learning for human protein subcellular localization," Plos One, vol. 7, no. 6, e37716, 2012. https://doi.org/10.1371/journal.pone.0037716
  5. G.-Z. Li, X. Wang, X. Hu, J.-M. Liu, and R.-W. Zhao, "Multilabel learning for protein subcellular location prediction," IEEE transactions on Nanobioscience, vol. 11, no. 3, pp. 237-243, 2012. https://doi.org/10.1109/TNB.2012.2212249
  6. S. Wan, M.-W. Mak, and S.-Y. Kung, "mGOASVM: multilabel protein subcellular localization based on gene ontology and support vector machines," BMC Bioinformatics, 13:290, 2012. https://doi.org/10.1186/1471-2105-13-290
  7. W.-Z. Lin, J.-A. Fang, X. Xiao, and K.-C. Chou, "iLoc-Animal: a multi-label learning classifier for predicting subcellular localization of animal proteins," Molecular BioSystems, vol. 9, no. 4, pp. 634-644, 2013. https://doi.org/10.1039/c3mb25466f
  8. X. Wang and G.-Z. Li, "Multilabel learning via random label selection for protein subcellular multilocations prediction," IEEE transactions on computational biology and bioinformatics, vol. 10, no. 2, pp. 436-446, 2013. https://doi.org/10.1109/TCBB.2013.21
  9. G. Tsoumakas, I. Katakis, and I. Vlahavas, "Mining multilabel data," in Data Mining and Knowledge Discovery Handbook. Boston, MA: Springer, ch. 34, pp. 667-685, 2010.
  10. G. Madjarov, D. Kocev, D. Gjorgjevikj, and S. Dzeroski, "An extensive experimental comparison of methods for multi-label learning," Pattern Recognition, vol. 45, no. 9, pp. 3084-3104, 2012. https://doi.org/10.1016/j.patcog.2012.03.004
  11. M.-L. Zhang and Z-H. Zhou, "A review on multi-label learning algorithms," IEEE transactions on knowledge and data engineering, http://doi.ieeecomputersociety.org/10.1109 /TKDE.2013.39.
  12. M.-L. Zhang and Z-H. Zhou, "Ml-knn: A lazy learning approach to multi-label learning," Pattern Recognition, vol. 40, no. 7, pp. 2038-2048, 2007. https://doi.org/10.1016/j.patcog.2006.12.019
  13. E. Spyromitros, G. Tsoumakas, and I. Vlahavas, "An Empirical Study of Lazy Multilabel Classification Algorithms," in Proceeding of the 5th Hellenic Conference on Artificial Intelligence, pp. 401-406, 2008.
  14. W. Cheng and E. Hullermeier, "Combining instance-based learning and logistic regression for multilabel classification," Machine Learning, vol. 76, no. 2-3, pp. 211-225, 2009. https://doi.org/10.1007/s10994-009-5127-5
  15. M.-L. Zhang and Z-H. Zhou, "Multi-label neural networks with applications to functional genomics and text categorization," IEEE transactions on knowledge and data engineering, vol. 18, no. 10, pp. 1338-1351, 2006. https://doi.org/10.1109/TKDE.2006.162
  16. J. Read, B. Pfahringer, H. Geoff, and F. Eibe, "Classifier Chains for Multi-label Classification," Machine Learning, vol. 85, no. 3. pp. 335-359, 2011.
  17. J. Read, B. Pfahringer, and H. Geoff, "Multi-Label Classification using Ensembles of Pruned Sets," in Proceeding of the 8th IEEE International Conference on Data Mining, pp. 995-1000, 2008.
  18. J. Furnkranz, E. Hullermeier, E. L. Mencia, and K. Brinker, "Multilabel classification via calibrated label ranking," Machine Learning, vol. 73, no. 2, pp. 133-153, 2008. https://doi.org/10.1007/s10994-008-5064-8
  19. G. Tsoumakas, I. Katakis, and I. Vlahavas, "Effective and Efficient Multilabel Classification in Domains with Large Number of Labels," in Proceeding of ECML/PKDD 2008 Workshop on Mining Multidimensional Data (MMD'08), pp. 30-44. 2008.
  20. G. Nasierding, G. Tsoumakas, and A. Kouzani, "Clustering Based Multi-Label Classification for Image Annotation and Retrieval," in Proceeding of 2009 IEEE International Conference on Systems, Man, and Cybernetics, pp. 4514- 4519, 2009.
  21. G. Tsoumakas, I. Katakis, and I. Vlahavas, "Random k-Labelsets for Multi-Label Classification," IEEE transactions on knowledge and data engineering, vol. 23, no. 7, pp. 1079- 1089, 2011. https://doi.org/10.1109/TKDE.2010.164
  22. R. E. Schapire and Y. Singer, "BoosTexter: A boostingbased system for text categorization," Machine learning, vol. 39, no. 2-3, pp. 135-168, 2000. https://doi.org/10.1023/A:1007649029923
  23. G. Tsoumakas, A. Dimou, E. Spyromitros, V. Mezaris, I. Kompatsiaris, and I. Vlahavas, "Correlation-Based Pruning of Stacked Binary Relevance Models for Multi-Label Learning," in Proceeding of ECML/PKDD 2009 Workshop on Learning from Multi-Label Data (MLD'09), pp. 101- 116, 2009.
  24. M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. H. Witten, "The WEKA Data Mining Software: An Update," ACM SIGKDD explorations newsletter, vol. 11, no.1, pp. 10-18, 2009. https://doi.org/10.1145/1656274.1656278
  25. S.-M. Chi, "Prediction of protein subcellular localization by weighted gene ontology terms," Biochemical and biophysical research communications, vol. 399, no. 3, pp. 402-405, 2010. https://doi.org/10.1016/j.bbrc.2010.07.086

Cited by

  1. Prediction of Protein Subcellular Localization using Label Power-set Classification and Multi-class Probability Estimates vol.18, pp.10, 2014, https://doi.org/10.6109/jkiice.2014.18.10.2562
  2. 다중레이블 조합을 사용한 단백질 세포내 위치 예측 vol.18, pp.7, 2014, https://doi.org/10.6109/jkiice.2014.18.7.1749