Browse > Article
http://dx.doi.org/10.5351/KJAS.2022.35.5.667

A review of gene selection methods based on machine learning approaches  

Lee, Hajoung (Department of Statistics, Sungkyunkwan University)
Kim, Jaejik (Department of Statistics, Sungkyunkwan University)
Publication Information
The Korean Journal of Applied Statistics / v.35, no.5, 2022 , pp. 667-684 More about this Journal
Abstract
Gene expression data present the level of mRNA abundance of each gene, and analyses of gene expressions have provided key ideas for understanding the mechanism of diseases and developing new drugs and therapies. Nowadays high-throughput technologies such as DNA microarray and RNA-sequencing enabled the simultaneous measurement of thousands of gene expressions, giving rise to a characteristic of gene expression data known as high dimensionality. Due to the high-dimensionality, learning models to analyze gene expression data are prone to overfitting problems, and to solve this issue, dimension reduction or feature selection techniques are commonly used as a preprocessing step. In particular, we can remove irrelevant and redundant genes and identify important genes using gene selection methods in the preprocessing step. Various gene selection methods have been developed in the context of machine learning so far. In this paper, we intensively review recent works on gene selection methods using machine learning approaches. In addition, the underlying difficulties with current gene selection methods as well as future research directions are discussed.
Keywords
gene selection; gene expression data; supervised learning; unsupervised learning;
Citations & Related Records
Times Cited By KSCI : 6  (Citation Analysis)
연도 인용수 순위
1 Wang H, Jing X, and Niu B (2017b). A discrete bacterial algorithm for feature selection in classification of microarray gene expression cancer data, Knowledge-Based Systems, 126, 8-19.   DOI
2 Witten DM and Tibshirani R (2010). A framework for feature selection in clustering, Journal of the American Statistical Association, 105, 713-726.   DOI
3 Xu R, Damelin S, Nadler B, and Wunsch II DC (2010). Clustering of high-dimensional gene expression data with feature filtering methods and diffusion maps, Artificial Intelligence in Medicine, 48, 91-98.   DOI
4 Yang K, Cai Z, Li J, and Lin G (2006). A stable gene selection in microarray data analysis, BMC Bioinformatics, 7, 1-16.   DOI
5 Yang Y, Yin P, Luo Z, Gu W, Chen R, and Wu Q (2019). Informative feature clustering and selection for gene expression data, IEEE Access, 7, 169174-169184.   DOI
6 Yu L, Han Y, and Berens ME (2011). Stable gene selection from microarray data via sample weighting, IEEE/ACM Transactions on Computational Biology and Bioinformatics, 9, 262-272.
7 Yu L and Liu H (2004). Efficient feature selection via analysis of relevance and redundancy, The Journal of Machine Learning Research, 5, 1205-1224.
8 Yu Z, Chen H, You J, Wong HS, Liu J, Li L, and Han G (2014). Double selection based semi-supervised clustering ensemble for tumor clustering from gene expression profiles, IEEE/ACM Transactions on Computational Biology and Bioinformatics, 11, 727-740.   DOI
9 Zare M, Eftekhari M, and Aghamollaei G (2019). Supervised feature selection via matrix factorization based on singular value decomposition, Chemometrics and Intelligent Laboratory Systems, 185, 105-113.   DOI
10 Zhang Y, Deng Q, Liang W, and Zou X (2018). An efficient feature selection strategy based on multiple support vector machine technology with gene expression data, BioMed Research International, 2018, 7538204.
11 Vanitha CDA, Devaraj D, and Venkatesulu M (2015). Gene expression data classification using support vector machine and mutual information-based gene selection, Procedia Computer Science, 47, 13-21.   DOI
12 Mishra S and Mishra D (2015). SVM-BT-RFE: An improved gene selection framework using Bayesian T-test embedded in support vector machine (recursive feature elimination) algorithm, Karbala International Journal of Modern Science, 1, 86-96.   DOI
13 Shukla AK and Tripathi D (2019). Identification of potential biomarkers on microarray data using distributed gene selection approach, Mathematical Biosciences, 315, 108230.   DOI
14 Boulesteix AL, Strobl C, Augustin T, and Daumer M (2008). Evaluating microarray-based classifiers: an overview, Cancer Informatics, 6, CIN-S408.
15 Chakraborty D and Maulik U (2014). Identifying cancer biomarkers from microarray data using feature selection and semisupervised learning, IEEE Journal of Translational Engineering in Health and Medicine, 2, 1-11.   DOI
16 Chandrashekar G and Sahin F (2014). A survey on feature selection methods, Computers and Electrical Engineering, 40, 16-28.   DOI
17 Solorio-Fernandez S, Martinez-Trinidad JF, and Carrasco-Ochoa JA (2017). A new unsupervised spectral feature selection method for mixed data: a filter approach, Pattern Recognition, 72, 314-326.   DOI
18 Sun L, Zhang X, Qian Y, Xu J, and Zhang S (2019). Feature selection using neighborhood entropy-based uncertainty measures for gene expression data classification, Information Sciences, 502, 18-41.   DOI
19 Tong DL and Schierz AC (2011). Hybrid genetic algorithm-neural network: Feature extraction for unpreprocessed microarray data, Artificial Intelligence in Medicine, 53, 47-56.   DOI
20 Chuang LY, Chang HW, Tu CJ, and Yang CH (2008). Improved binary PSO for feature selection using gene expression data, Computational Biology and Chemistry, 32, 29-38.   DOI
21 Elghazel H and Aussem A (2015). Unsupervised feature selection with ensemble learning, Machine Learning, 98, 157-180.   DOI
22 Chuang LY, Yang CH, and Yang CH (2009). Tabu search and binary particle swarm optimization for feature selection using microarray data, Journal of Computational Biology, 16, 1689-1703.   DOI
23 Du D, Li K, and Deng J (2012). An efficient two-stage gene selection method for microarray data, In International Conference on Intelligent Computing for Sustainable Energy and Environment, 355, 424-432.
24 El Akadi A, Amine A, El Ouardighi A, and Aboutajdine D (2011). A two-stage gene selection scheme utilizing MRMR filter and GA wrapper, Knowledge and Information Systems, 26, 487-500.   DOI
25 Filippone M, Masulli F, and Rovetta S (2005). Unsupervised gene selection and clustering using simulated annealing, In International Workshop on Fuzzy Logic and Applications, 3849, 229-235.
26 Fujita A, Patriota AG, Sato JR, and Miyano S (2009). The impact of measurement errors in the identification o f regulatory networks, BMC Bioinformatics, 10, 412.   DOI
27 Mundra PA and Rajapakse JC (2009). SVM-RFE with MRMR filter for gene selection, IEEE Transactions on Nanobioscience, 9, 31-37.   DOI
28 Gangeh MJ, Zarkoob H, and Ghodsi A (2017). Fast and scalable feature selection for gene expression data using hilbert-schmidt independence criterion, IEEE/ACM Transactions on Computational Biology and Bioinformatics, 14, 167-181.   DOI
29 Garcia-Nieto J, Alba E, Jourdan L, and Talbi E (2009). Sensitivity and specificity based multiobjective approach for feature selection: Application to cancer diagnosis, Information Processing Letters, 109, 887-896.   DOI
30 George G and Raj VC (2011). Review on feature selection techniques and the impact of SVM for cancer classification using gene expression profile, Available from: http://doi.org/arXiv preprint arXiv:1109.1062   DOI
31 Ghosh M, Begum S, Sarkar R, Chakraborty D, and Maulik U (2019b). Recursive memetic algorithm for gene selection in microarray data, Expert Systems with Applications, 116, 172-185.   DOI
32 Li Z, Liao B, Cai L, Chen M, and Liu W (2018). Semi-supervised maximum discriminative local margin for gene selection, Scientific Reports, 8, 1-11.
33 Lazar C, Taminau J, Meganck S, et al. (2012). A survey on filter techniques for feature selection in gene expression microarray analysis, IEEE/ACM Transactions on Computational Biology and Bioinformatics, 9, 1106-1119.   DOI
34 Leung Y and Hung Y (2008). A multiple-filter-multiple-wrapper approach to gene selection and microarray data classification, IEEE/ACM Transactions on Computational Biology and Bioinformatics, 7, 108-117.   DOI
35 Li J, Tang J, and Liu H (2017). Reconstruction-based unsupervised feature selection: An embedded approach, In IJCAI, 2159-2165.
36 Liao B, Jiang Y, Liang W, Zhu W, Cai L, and Cao Z (2014). Gene selection using locality sensitive Laplacian score, IEEE/ACM Transactions on Computational Biology and Bioinformatics, 11, 1146-1156.   DOI
37 Chinnaswamy A and Srinivasan R (2016). Hybrid feature selection using correlation coefficient and particle swarm optimization on microarray gene expression data, In Innovations in Bio-Inspired Computing and Applications, 424, 229-239.   DOI
38 Devi Arockia Vanitha C, Devaraj D, and Venkatesulu M (2016). Multiclass cancer diagnosis in microarray gene expression profile using mutual information and support vector machine, Intelligent Data Analysis, 20, 1425-1439.   DOI
39 Hambali MA, Oladele TO, and Adewole KS (2020). Microarray cancer feature selection: review, challenges and research directions, International Journal of Cognitive Computing in Engineering, 1, 78-97.   DOI
40 Rouhi A and Nezamabadi-pour H (2018). Filter-based feature selection for microarray data using improved binary gravitational search algorithm, In 2018 3rd Conference on Swarm Intelligence and Evolutionary Computation (CSIEC), IEEE, 1-6.
41 Seijo-Pardo B, Bolon-Canedo V, and Alonso-Betanzos A (2016). Using a feature selection ensemble on DNA microarray datasets, ESANN, 277-282.
42 Sharma A, Imoto S, Miyano S, and Sharma V (2012). Null space based feature selection method for gene expression data, International Journal of Machine Learning and Cybernetics, 3, 269-276.   DOI
43 Solorio-Fernandez S, Carrasco-Ochoa JA, and Martinez-Trinidad JF (2016). A new hybrid filter-wrapper feature selection method for clustering based on ranking, Neurocomputing, 214, 866-880.   DOI
44 Ji G, Yang Z, and You W (2010). PLS-based gene selection and identification of tumor-specific genes, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 41, 830-841.
45 Sun Y, Todorovic S, and Goodison S (2009). Local-learning-based feature selection for high-dimensional data analysis, IEEE Transactions on Pattern Analysis and Machine Intelligence, 32, 1610-1626.
46 Hasri NM, Wen NH, Howe CW, Mohamad MS, Deris S, and Kasim S (2017). Improved support vector machine using multiple SVM-RFE for cancer classification, International Journal on Advanced Science, Engineering and Information Technology, 7, 1589-1594.   DOI
47 Huang HL and Chang FL (2007). ESVM: Evolutionary support vector machine for automatic feature selection and classification of microarray data, Biosystems, 90, 516-528.   DOI
48 Kalakech M, Biela P, Macaire L, and Hamad D (2011). Constraint scores for semi-supervised feature selection: A comparative study, Pattern Recognition Letters, 32, 656-665.   DOI
49 Kira K and Rendell LA (1992). A practical approach to feature selection, In Machine Learning Proceedings 1992, Morgan Kaufmann, 249-256.
50 Kumar CA, Sooraj MP, and Ramakrishnan S (2017). A comparative performance evaluation of supervised feature selection algorithms on microarray datasets, Procedia Computer Science, 115, 209-217.   DOI
51 Awada W, Khoshgoftaar TM, Dittman D, Wald R, and Napolitano A (2012). A review of the stability of feature selection techniques for bioinformatics data, In 2012 IEEE 13th International Conference on Information Reuse and Integration (IRI), IEEE, 356-363.
52 Lan L and Vucetic S (2011). Improving accuracy of microarray classification by a simple multi-task feature selection filter, International Journal of Data Mining and Bioinformatics, 5, 189-208.   DOI
53 Almugren N and Alshamlan H (2019). A survey on hybrid feature selection methods in microarray gene expression data for cancer classification, IEEE Access, 7, 78533-78548.   DOI
54 Ang JC, Mirzal A, Haron H, and Hamed HNA (2015b). Supervised, unsupervised, and semi-supervised feature selection: a review on gene selection, IEEE/ACM Transactions on Computational Biology and Bioinformatics, 13, 971-989.
55 Ben Brahim A and Limam M (2018). Ensemble feature selection for high dimensional data: a new method and a comparative study, Advances in Data Analysis and Classification, 12, 937-952.   DOI
56 Benabdeslem K and Hindawi M (2011). Constrained laplacian score for semi-supervised feature selection, In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, 6911, 204-218.
57 Ghosh M, Adhikary S, Ghosh KK, Sardar A, Begum S, and Sarkar R (2019a). Genetic algorithm based cancerous gene identification from microarray data using ensemble of filter methods, Medical and Biological Engineering and Computing, 57, 159-176.   DOI
58 Boutsidis C, Mahoney MW, and Drineas P (2008). Unsupervised feature selection for principal components analysis, In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Nevada, 61-69.
59 Wang A, An N, Yang J, Chen G, Li L, and Alterovitz G (2017a). Wrapper-based gene selection with Markov blanket, Computers in Biology and Medicine, 81, 11-23.   DOI
60 Galar M, Fernandez A, Barrenechea E, and Herrera F (2013). EUSBoost: Enhancing ensembles for highly im-balanced data-sets by evolutionary undersampling, Pattern Recognition, 46, 3460-3471.   DOI
61 Djellali H, Guessoum S, Ghoualmi-Zine N, and Layachi S (2017). Fast correlation based filter combined with genetic algorithm and particle swarm on feature selection, In 2017 5th International Conference on Electrical Engineering-Boumerdes (ICEE-B), IEEE, 1-6.
62 Sharma A, Imoto S, and Miyano S (2011). A top-r feature selection algorithm for microarray gene expression data, IEEE/ACM Transactions on Computational Biology and Bioinformatics, 9, 754-764.
63 Anaissi A, Kennedy PJ, Goyal M, and Catchpoole DR (2013). A balanced iterative random forest for gene selection from microarray data, BMC Bioinformatics, 14, 1-10.   DOI
64 Li G, Zhang W, Zeng H, Chen L, Wang W, Liu J, Zhang Z, and Cai Z (2009). An integrative multi-platform analysis for discovering biomarkers of osteosarcoma, BMC Cancer, 9, 150.   DOI
65 Shen Q, Mei Z, and Ye BX (2009). Simultaneous genes and training samples selection by modified particle swarm optimization for gene expression data classification, Computers in Biology and Medicine, 39, 646-649.   DOI
66 Mohamed E, El Houby EM, Wassif KT, and Salah AI (2016). Survey on different methods for classifying gene expression using microarray approach, International Journal of Computer Applications, 150, 975-8887.
67 Li C and Li H (2008). Network-constrained regularization and variable selection for analysis of genomic data, Bioinformatics, 24, 1175-1182.   DOI
68 Liu H, Motoda H, Setiono R, and Zhao Z (2010). Feature selection: An ever evolving frontier in data mining, In Feature Selection in Data Mining, PMLR, 10, 4-13.
69 Mahendran N, Durai Raj Vincent PM, Srinivasan K, and Chang CY (2020). Machine learning based computational gene selection models: a survey, performance evaluation, open issues, and future research directions, Frontiers in Genetics, Available from: http://doi.org/10.3389/fgene.2020.603808   DOI
70 Ang JC, Haron H, and Hamed HNA (2015a). Semi-supervised SVM-based feature selection for cancer classification using microarray gene expression data, In International Conference on Industrial, Engineering and Other Applications of Applied Intelligent Systems, 9101, 468-477.
71 Boucheham A, Batouche M, and Meshoul S (2015). An ensemble of cooperative parallel metaheuristics for gene selection in cancer classification, In International Conference on Bioinformatics and Biomedical Engineering, 9044, 301-312.
72 Irigoyen A, Jimenez-Luna C, Benavides M, et al. (2018). Integrative multi-platform meta-analysis of gene expression profiles in pancreatic ductal adenocarcinoma patients for identifying novel diagnostic biomarkers, PloS One, 13, e0194844.   DOI
73 Zhao J, Lu K, and He X (2008). Locality sensitive semi-supervised feature selection, Neurocomputing, 71, 1842-1849.   DOI
74 Reel PS, Reel S, Pearson E, Trucco E, and Jefferson E (2021). Using machine learning approaches for multiomics data analysis: A review, Biotechnology Advances, 49, 107739.   DOI
75 Wang L, Wang Y, and Chang Q (2016). Feature selection methods for big data bioinformatics: a survey from the search perspective, Methods, 111, 21-31.   DOI
76 Wang Y, Tetko IV, Hall MA, Frank E, Facius A, Mayer KF, and Mewes HW (2005). Gene selection from microarray data for cancer classification-a machine learning approach, Computational Biology and Chemistry, 29, 37-46.   DOI
77 Yang J, Zhou J, Zhu Z, Ma X, and Ji Z (2016). Iterative ensemble feature selection for multiclass classification of imbalanced microarray data, Journal of Biological Research-Thessaloniki, 23, 1-9.   DOI
78 Ye X and Sakurai T (2017). Unsupervised Feature Learning for Gene Selection in Microarray Data Analysis, In Proceedings of the 1st International Conference on Medical and Health Informatics 2017, Taichung City, 101-106.
79 Nie F, Huang H, Cai X, and Ding C (2010). Efficient and robust feature selection via joint ℓ2, 1-norms minimization, Advances in Neural Information Processing Systems, 23, 1813-1821.
80 Peng Y, Wu Z, and Jiang J (2010). A novel feature selection approach for biomedical data classification, Journal of Biomedical Informatics, 43, 15-23.   DOI
81 Saeys Y (2004). Feature Selection for Classification of Nucleic Acid Sequences, Doctoral dissertation, Ghent University.
82 Saeys Y, Inza I, and Larranaga P (2007). A review of feature selection techniques in bioinformatics, Bioinformatics, 23, 2507-2517.   DOI
83 Shanab AA, Khoshgoftaar TM, and Wald R (2014). Evaluation of wrapper-based feature selection using hard, moderate, and easy bioinformatics data, In 2014 IEEE International Conference on Bioinformatics and Bioengineering, 49-155.
84 Sheikhpour R, Sarram MA, Gharaghani S, and Chahooki MAZ (2017). A survey on semi-supervised feature selection methods, Pattern Recognition, 64, 141-158.   DOI
85 Shen Q, Diao R, and Su P (2012). Feature selection ensemble, Turing-100, 10, 289-306.
86 Shukla AK, Singh P, and Vardhan M (2018). A hybrid gene selection method for microarray recognition, Biocybernetics and Biomedical Engineering, 38, 975-991.   DOI
87 Guo S, Guo D, Chen L, and Jiang Q (2017). A L1-regularized feature selection method for local dimension reduction on microarray data, Computational Biology and Chemistry, 67, 92-101.   DOI
88 Hajiloo M, Damavandi B, HooshSadat M, Sangi F, Mackey JR, Cass CE, Greiner R, and Damaraju S (2013). Breast cancer prediction using genome wide single nucleotide polymorphism data, BMC Bioinformatics, 14, 1-10.   DOI
89 Gutkin M, Shamir R, and Dror G (2009). SlimPLS: A method for feature selection in gene expression-based disease classification, PloS One, 4, Available from: http://doi.org/10.1371/journal.pone.0006416   DOI
90 Guyon I, Weston J, Barnhill S, and Vapnik V (2002). Gene selection for cancer classification using support vector machines, Machine Learning, 46, 389-422.   DOI
91 Halperin E, Kimmel G, and Shamir R (2005). Tag SNP selection in genotype data for maximizing SNP prediction accuracy, Bioinformatics, 21, i195-i203.   DOI
92 Hancer E, Xue B, and Zhang M (2018). Differential evolution for filter feature selection based on information theory and feature ranking, Knowledge-Based Systems, 140, 103-119.   DOI
93 Liaghat S and Mansoori EG (2016). Unsupervised selection of informative genes in microarray gene expression data, International Journal of Applied Pattern Recognition, 3, 351-367.   DOI
94 Li HD, Liang YZ, Xu QS, Cao DS, Tan BB, Deng BC, and Lin CC (2011). Recipe for uncovering predictive genes using support vector machines based on model population analysis, IEEE/ACM Transactions on Computational Biology and Bioinformatics, 8, 1633-1641.   DOI
95 Liu B, Wan C, and Wang L (2006). An efficient semi-unsupervised gene selection method via spectral biclustering, IEEE Transactions on Nanobioscience, 5, 110-114.   DOI
96 Liu H, Zhou M, and Liu Q (2019). An embedded feature selection method for imbalanced data classification, IEEE/CAA Journal of Automatica Sinica, 6, 703-715.   DOI
97 Loscalzo S, Yu L, and Ding C (2009). Consensus group stable feature selection, In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Paris, 567-576.
98 Liu J, Cheng Y, Wang X, Zhang L, and Wang ZJ (2018). Cancer characteristic gene selection via sample learning based on deep sparse filtering, Scientific Reports, 8, 1-13.
99 Liu JX, Wang YT, Zheng CH, Sha W, Mi JX, and Xu Y (2013). Robust PCA based method for discovering differentially expressed genes, BMC Bioinformatics, BioMed Central, 14, 1-10.   DOI
100 Liu Y (2009). Wavelet feature extraction for high-dimensional microarray data, Neurocomputing, 72, 985-990.   DOI
101 Mahapatra S and Swarnkar T (2021). Gene selection using integrative analysis of multi-level omics data: A systematic review, Data Analytics in Bioinformatics: A Machine Learning Perspective, 145-171.
102 Maldonado S, Weber R, and Basak J (2011). Simultaneous feature selection and classification using kernelpenalized support vector machines, Information Sciences, 181, 115-128.   DOI
103 Maugis C, Celeux G, and Martin-Magniette ML (2009). Variable selection for clustering with Gaussian mixture models, Journal of the International Biometric Society, 65, 701-709.
104 Maulik U and Chakraborty D (2014). Fuzzy preference based feature selection and semisupervised SVM for cancer classification, IEEE Transactions on Nanobioscience, 13, 152-160.   DOI
105 Mazumder DH and Veilumuthu R (2019). An enhanced feature selection filter for classification of microarray cancer data, ETRI Journal, 41, 358-370.   DOI