Diversity based Ensemble Genetic Programming for Improving Classification Performance

분류 성능 향상을 위한 다양성 기반 앙상블 유전자 프로그래밍

  • 홍진혁 (연세대학교 컴퓨터과학과) ;
  • 조성배 (연세대학교 컴퓨터과학과)
  • Published : 2005.12.01

Abstract

Combining multiple classifiers has been actively exploited to improve classification performance. It is required to construct a pool of accurate and diverse base classifier for obtaining a good ensemble classifier. Conventionally ensemble learning techniques such as bagging and boosting have been used and the diversify of base classifiers for the training set has been estimated, but there are some limitations in classifying gene expression profiles since only a few training samples are available. This paper proposes an ensemble technique that analyzes the diversity of classification rules obtained by genetic programming. Genetic programming generates interpretable rules, and a sample is classified by combining the most diverse set of rules. We have applied the proposed method to cancer classification with gene expression profiles. Experiments on lymphoma cancer dataset, prostate cancer dataset and ovarian cancer dataset have illustrated the usefulness of the proposed method. h higher classification accuracy has been obtained with the proposed method than without considering diversity. It has been also confirmed that the diversity increases classification performance.

분류 성능을 향상시키기 위해서 다수의 분류기들을 결합하는 연구가 활발히 진행되고 있다. 우수한 앙상블 분류기를 회득하기 위해서는 정확하고 다양한 개별 분류기를 구축해야 한다. 기존에는 Bagging이나 Boosting 등의 앙상블 학습 기법을 이용하거나 획득된 개별 분류기의 학습 데이타에 대한 다양성을 측정하였지만 유전 발현 데이타와 같이 학습 데이타가 적은 경우 한계가 있다. 본 논문에서는 유전자 프로그래밍으로부터 획득된 규칙의 구조적 다양성을 분석하여 결합하는 앙상블 기법을 제안한다. 유전자 프로그래밍으로 해석 가능한 분류 규칙을 생성하고 그들 사이의 다양성을 측정한 뒤, 이들 중 다양한 규칙의 집합을 결합하여 분류를 수행한다. 유전 발현 데이타로부터 림프종 암, 폐 암, 난소 암 등을 분류하는 문제를 대상으로 실험하여 제안하는 방법의 유용성을 검증하였다. 앙상블 시 분류 규칙 사이의 다양성을 분석하여 결합한 결과, 다양성을 고려하지 않을 때보다 높은 분류 성능을 획득하였고, 개별 분류 규칙들 사이의 다양성에 따라서 정분류율이 증가하는 것도 확인하였다.

Keywords

References

  1. U. Schmidt and C. Begley, 'Cancer diagnosis and microarrays,' The Int. J. of Biochemistry & Cell Biology, vol. 35, no. 2, pp. 119-124, 2003 https://doi.org/10.1016/S1357-2725(02)00124-3
  2. I. Sarkar, et aI., 'Characteristic attributes in cancer microarrays,' J. of Biomedical Informatics, vol. 35, no. 2, pp. 111-122, 2002 https://doi.org/10.1016/S1532-0464(02)00504-X
  3. J. Khan, et aI., 'Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks,' Nature Medicine, vol. 7, no. 6, pp. 673-679, 2001 https://doi.org/10.1038/89044
  4. V. Roth and T. Lange, 'Bayesian class discovery in microarray datasets,' IEEE Trans. Biomedical Engineering, vol. 51, no. 5, pp. 707-718, 2004 https://doi.org/10.1109/TBME.2004.824139
  5. C. Ding and I. Dubchak, 'Multi-class protein fold recognition using support vector machines and neural networks,' Bioinformatics, vol. 17, no. 4, pp. 349-358, 2001 https://doi.org/10.1093/bioinformatics/17.4.349
  6. N. Camp and M. Slattery, 'Classification tree analysis: A statistical tool to investigate risk factor interactions with an example for colon cancer,' Cancer Causes and Control, vol. 13, no. 9, pp. 813-823, 2002 https://doi.org/10.1023/A:1020611416907
  7. L. Li, et aI., 'Gene selection for sample classification based on gene expression data: Study of sensitivity to choice of parameters of the GA/KNN method,' Bioinformatics, vol. 17, no. 12, pp. 1131-1142, 2001 https://doi.org/10.1093/bioinformatics/17.12.1131
  8. J. Deutsch, 'Evolutionary algorithms for finding optimal gene sets in microarray prediction,' Bioinformatics, vol. 19, no. 1, pp. 45-52, 2003 https://doi.org/10.1093/bioinformatics/19.1.45
  9. M. Karzynski, et aI., 'Using a genetic algorithm and a perceptron for feature selection and supervised class learning in DNA microarray data,' Artificial Intelligence Review, vol. 20, no. 1-2, pp. 39-51, 2003 https://doi.org/10.1023/A:1026032530166
  10. W. Langdon and B. Buxton, 'Genetic programming for mining DNA chip data for cancer patients,' Genetic Programming and Evolvable Machines, vol. 5, no. 3, pp. 251-257, 2004 https://doi.org/10.1023/B:GENP.0000030196.55525.f7
  11. G. Valentini, 'Gene expression data analysis of human lymphoma using support vector machines and output coding ensembles,' Artificial Intelligence in Medicine, vol. 26, no. 3, pp. 281-304, 2002 https://doi.org/10.1016/S0933-3657(02)00077-5
  12. C. Park and S.-B. Cho, 'Evolutionary computation for optimal ensemble classifier in lymphoma cancer classification,' Lecture Notes in Artificial Intelligence, vol. 2871, pp. 521-530, 2003 https://doi.org/10.1007/b14019
  13. A. Tan and D. Gilbert, 'Ensemble machine learning on gene expression data for cancer classification,' Applied Bioinformatics, vol. 2, no. 3 Suppl., pp. S75-S83, 2003
  14. L. Kuncheva, 'A theoretical study on six classifier fusion strategies,' IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 24, no. 2, pp. 281-286, 2002 https://doi.org/10.1109/34.982906
  15. R. Bryll, et aI., 'Attribute bagging: Improving accuracy of classifier ensembles by using random feature subsets,' Pattern Recognition, vol. 36, no. 6, pp. 1291-1302, 2003 https://doi.org/10.1016/S0031-3203(02)00121-8
  16. G. Webb and Z. Zheng, 'Multistrategy ensemble learning: Reducing error by combining ensemble learning techniques,' IEEE Trans. Knowledge and Data Engineering, vol. 16, no. 8, pp. 980-991, 2004 https://doi.org/10.1109/TKDE.2004.29
  17. D. Optiz and R. Maclin, 'Popular ensemble methods: An empirical study,' J. of Artificial Intelligence Research, vol. 11, pp. 169-198, 1999
  18. M. Islam, et al., 'A constructive algorithm for training cooperative neural network ensembles,' IEEE Trans. Neural Network, vol. 14, no. 4, pp. 820-834, 2003 https://doi.org/10.1109/TNN.2003.813832
  19. C. Shipp and L. Kuncheva, 'Relationships between combination methods and measures of diversity in combining classifiers,' Information Fusion, vol. 3, no. 2, pp. 135-148, 2002 https://doi.org/10.1016/S1566-2535(02)00051-9
  20. J.-H. Hong and S.-B. Cho, 'Rule discovery for cancer classification using genetic programming based on arithmetic operators,' J. of Korea Information Science Society: Software and Applications, vol. 31, no. 8, pp. 999-1009, 2004
  21. J. Koza, 'Genetic programming,' Encyclopedia of Computer Science and Technology, vol. 39, pp. 29-43, 1999
  22. Y. Zhang and S. Bhattacharyya, 'Genetic programming in classifying large-scale data: An ensemble method,' Information Sciences, vol. 163, no. 1-3, pp. 85-101, 2004 https://doi.org/10.1016/j.ins.2003.03.028
  23. M. Brameier and W. Banzhaf, 'Evolving teams of predictors with linear genetic programming,' Genetic Programming and Evolvable Machines, vol. 2, no. 4, pp. 381-407, 2001 https://doi.org/10.1023/A:1012978805372
  24. F. Fernaandez, et aI., 'An empirical study of multipopulation genetic programming,' Genetic Programming and Evolvable Machines, vol. 4, no. 1, pp. 21-51, 2003 https://doi.org/10.1023/A:1021873026259
  25. K. Imamura, et aI., 'Behavioral diversity and a probabilistically optimal GP ensemble,' Genetic Programming and Evolvable Machines, vol. 4, no. 3, pp. 235-253, 2003 https://doi.org/10.1023/A:1025124423708
  26. G. Zenobi and P. Cunningham, 'Using diversity in preparing ensembles of classifiers based on different feature subsets to minimize generalization error,' Lecture Notes in Computer Science, vol. 2167, pp. 576-587, 2001 https://doi.org/10.1007/3-540-44795-4_49
  27. L. Kuncheva and C. Whitaker, 'Measures of diversity in classifier ensembles and their relationship with the ensemble accuracy,' Machine Learning, vol. 51, no. 2, pp. 181-207, 2003 https://doi.org/10.1023/A:1022859003006
  28. T. Windeatt, 'Diversity measures for multiple classifier system analysis and design,' Information Fusion, 2004
  29. E. Bruke, et aI., 'Diversity in genetic programming: An analysis of measures and correlation with fitness,' IEEE Trans. Evolutionary Computation, vol. 8, no. 1, pp. 47-62, 2004 https://doi.org/10.1109/TEVC.2003.819263
  30. L. Kuncheva, et aI., 'Decision templates for multiple classifier fusion: An experimental comparison,' Pattern Recognition, vol. 34, no. 2, pp. 299-314, 2001 https://doi.org/10.1016/S0031-3203(99)00223-X
  31. S. Tong and D. Koller, 'Support vector machine active learning with applications to text classification,' J. of Machine Learning Research, vol. 2, pp. 45-66, 2001 https://doi.org/10.1162/153244302760185243
  32. A. Alizadeh, et aI., 'Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling,' Nature, vol. 403, no. 6769, pp. 503-511, 2000 https://doi.org/10.1038/35000501
  33. G. Gordon, et aI., 'Translation of microarray data into clinically relevant cancer diagnostic tests using gene expression ratios in lung cancer and mesothelioma,' Cancer Research, vol. 62, no. 17, pp. 4963-4967, 2002
  34. E. Petricoin III, et aI., 'Use of proteomic patterns in serum to identify ovarian cancer,' The Lancet, vol. 359, no. 9306, pp. 572-577, 2002 https://doi.org/10.1016/S0140-6736(02)07746-2