DOI QR코드

DOI QR Code

Ensemble Gene Selection Method Based on Multiple Tree Models

  • Mingzhu Lou (School of Information and Engineering, Nanchang Institute of Technology)
  • Received : 2022.09.23
  • Accepted : 2023.04.08
  • Published : 2023.10.31

Abstract

Identifying highly discriminating genes is a critical step in tumor recognition tasks based on microarray gene expression profile data and machine learning. Gene selection based on tree models has been the subject of several studies. However, these methods are based on a single-tree model, often not robust to ultra-highdimensional microarray datasets, resulting in the loss of useful information and unsatisfactory classification accuracy. Motivated by the limitations of single-tree-based gene selection, in this study, ensemble gene selection methods based on multiple-tree models were studied to improve the classification performance of tumor identification. Specifically, we selected the three most representative tree models: ID3, random forest, and gradient boosting decision tree. Each tree model selects top-n genes from the microarray dataset based on its intrinsic mechanism. Subsequently, three ensemble gene selection methods were investigated, namely multipletree model intersection, multiple-tree module union, and multiple-tree module cross-union, were investigated. Experimental results on five benchmark public microarray gene expression datasets proved that the multiple tree module union is significantly superior to gene selection based on a single tree model and other competitive gene selection methods in classification accuracy.

Keywords

Acknowledgement

This work was partially supported by the funds from Jiangxi Education Department, PR China (No. GJJ211919), the grants from and National Nature Science Foundation of China (No. 62166028).

References

  1. H. Rao, X. Shi, A. K. Rodrigue, J. Feng, Y. Xia, M. Elhoseny, X. Yuan, and L. Gu, "Feature selection based on artificial bee colony and gradient boosting decision tree," Applied Soft Computing, vol. 74, pp. 634-642, 2019. https://doi.org/10.1016/j.asoc.2018.10.036
  2. J. T. Horng, L. C. Wu, B. J. Liu, J. L. Kuo, W. H. Kuo, and J. J. Zhang, "An expert system to classify microarray gene expression data using gene selection by decision tree," Expert Systems with Applications, vol. 36, no. 5, pp. 9072-9081, 2009. https://doi.org/10.1016/j.eswa.2008.12.037
  3. W, Xiong and C. Wang, "A hybrid improved ant colony optimization and random forests feature selection method for microarray data," in Proceedings of 2009 5th International Joint Conference on INC, IMS and IDC, Seoul, South Korea, 2019, pp. 559-563. https://doi.org/10.1109/NCM.2009.66
  4. G. Dagnew and B. H. Shekar, "Ensemble learning-based classification of microarray cancer data on treebased features," Cognitive Computation and Systems, vol. 3, no. 1, pp. 48-60, 2021. https://doi.org/10.1049/ccs2.12003
  5. X. Deng, M. Li, S. Deng, and L. Wang, "Hybrid gene selection approach using XGBoost and multi-objective genetic algorithm for cancer classification," Medical & Biological Engineering & Computing, vol. 60, no. 3, pp. 663-681, 2022. https://doi.org/10.1007/s11517-021-02476-x
  6. T. Chen and C. Guestrin, "XGBoost: a scalable tree boosting system," in Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, 2016, pp. 785-794. https://doi.org/10.1145/2939672.2939785
  7. J. R. Quinlan, "Induction of decision trees," Machine Learning, vol. 1, pp. 81-106, 1986. https://doi.org/10.1023/A:1022643204877
  8. L. Breiman, "Random forests," Machine Learning, vol. 45, pp. 5-32, 2001. https://doi.org/10.1023/A:1010933404324
  9. J. H. Friedman, "Stochastic gradient boosting," Computational Statistics & Data Analysis, vol. 38, no. 4, pp. 367-378, 2002. https://doi.org/10.1016/S0167-9473(01)00065-2
  10. C. C. Chang and C. J. Lin, "LIBSVM: a library for support vector machines," ACM Transactions on Intelligent Systems and Technology (TIST), vol. 2, no. 3, article no. 27, 2011. https://doi.org/10.1145/1961189.1961199
  11. G. H. John and P. Langley, "Estimating continuous distributions in Bayesian classifiers," in Proceedings of the 11th Annual Conference on Uncertainty in Artificial Intelligence, Montreal, Canada, 1995, pp. 338-345.
  12. J. R. Quinlan, C4.5: Programs for Machine Learning. San Mateo, CA: Morgan Kaufmann Publishers, 2014.
  13. A. D. Gordon, "Review of Classification and Regression Trees by L. Breiman, J. Friedman, R. A. Olshen, and C. J. Stone, editors," Biometrics, vol. 40, no. 3, pp. 874-874, 1984. https://doi.org/10.2307/2530946
  14. J. H. Friedman, "Greedy function approximation: a gradient boosting machine," Annals of Statistics, vol. 29, no. 5, pp. 1189-1232, 2001. https://doi.org/10.1214/aos/1013203451
  15. F. Borovecki, L. Lovrecic, J. Zhou, H. Jeong, F. Then, H. D. Rosas, et al., "Genome-wide expression profiling of human blood reveals biomarkers for Huntington's disease," Proceedings of the National Academy of Sciences, vol. 102, no. 31, pp. 11023-11028, 2005. https://doi.org/10.1073/pnas.0504921102
  16. A. A. Alizadeh, M. B. Eisen, R. E. Davis, C. Ma, I. S. Lossos, A. Rosenwald, et al., "Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling," Nature, vol. 403, no. 6769, pp. 503-511, 2000. https://doi.org/10.1038/35000501
  17. T. Li, C. Zhang, and M. Ogihara, "A comparative study of feature selection and multiclass classification methods for tissue classification based on gene expression," Bioinformatics, vol. 20, no. 15, pp. 2429-2437, 2004. https://doi.org/10.1093/bioinformatics/bth267
  18. A. Bhattacharjee, W. G. Richards, J. Staunton, C. Li, S. Monti, P. Vasa, et al., "Classification of human lung carcinomas by mRNA expression profiling reveals distinct adenocarcinoma subclasses," Proceedings of the National Academy of Sciences, vol. 98, no. 24, pp. 13790-13795, 2001. https://doi.org/10.1073/pnas.191502998
  19. Z. Zhu, Y. S. Ong, and M, Dash, "Markov blanket-embedded genetic algorithm for gene selection," Pattern Recognition, vol. 40, no. 11, pp. 3236-3248, 2007. https://doi.org/10.1016/j.patcog.2007.02.007
  20. Q. Al-Tashi, S. J. A. Kadir, H. M. Rais, S. Mirjalili, and H. Alhussian, "Binary optimization using hybrid grey wolf optimization for feature selection," IEEE Access, vol. 7, pp. 39496-39508, 2019. https://doi.org/10.1109/ACCESS.2019.2906757
  21. X. Huang, L. Zhang, B. Wang, F. Li, and Z. Zhang, "Feature clustering based support vector machine recursive feature elimination for gene selection," Applied Intelligence, vol. 48, pp. 594-607, 2018. https://doi.org/10.1007/s10489-017-0992-2
  22. Z. Hou and S. Y. Kung, "A kernel discriminant information approach to non-linear feature selection," in Proceedings of 2019 International Joint Conference on Neural Networks (IJCNN), Budapest, Hungary, 2019, pp. 1-10. https://doi.org/10.1109/IJCNN.2019.8852186
  23. R. J. Urbanowicz, R. S. Olson, P. Schmitt, M. Meeker, and J. H. Moore, "Benchmarking relief-based feature selection methods for bioinformatics data mining," Journal of Biomedical Informatics, vol. 85, pp. 168-188, 2018. https://doi.org/10.1016/j.jbi.2018.07.015