Browse > Article
http://dx.doi.org/10.7737/JKORMS.2014.39.4.075

Set Covering-based Feature Selection of Large-scale Omics Data  

Ma, Zhengyu (School of Industrial Management Engineering, Korea University)
Yan, Kedong (School of Information Management Engineering, Korea University)
Kim, Kwangsoo (Bioinformatics Institute, Seoul National University)
Ryoo, Hong Seo (School of Industrial Management Engineering, Korea University)
Publication Information
Abstract
In this paper, we dealt with feature selection problem of large-scale and high-dimensional biological data such as omics data. For this problem, most of the previous approaches used simple score function to reduce the number of original variables and selected features from the small number of remained variables. In the case of methods that do not rely on filtering techniques, they do not consider the interactions between the variables, or generate approximate solutions to the simplified problem. Unlike them, by combining set covering and clustering techniques, we developed a new method that could deal with total number of variables and consider the combinatorial effects of variables for selecting good features. To demonstrate the efficacy and effectiveness of the method, we downloaded gene expression datasets from TCGA (The Cancer Genome Atlas) and compared our method with other algorithms including WEKA embeded feature selection algorithms. In the experimental results, we showed that our method could select high quality features for constructing more accurate classifiers than other feature selection algorithms.
Keywords
Bioinformatics; Feature Selection; Set Covering Problem; Omics Data;
Citations & Related Records
연도 인용수 순위
  • Reference
1 Long, N., D. Gianola, G.J.M. Rosa, K.A. Weigel, and S. Avendano, "Machine learning classification procedure for selecting SNPs in genomic selection : application to early mortality in broilers," Journal of animal breeding and genetics, Vol.124, No.6(2007), pp.377-389.   DOI
2 Ren, X., Y. Wang, L. Chen, X. Zhang, and Q. Jin, "ellipsoidFN : a tool for identifying a heterogeneous set of cancer biomarkers based on gene expressions," Nucleic acids research, Vol.41, No.4(2013), pp.e53-e53.   DOI
3 Rubin, J., "A technique for the solution of massive set covering problems, with application to airline crew scheduling," Transportation Science, Vol.7, No.1(1973), pp.34-48.   DOI
4 Saeys, Y., I. Inza, and P. Larranaga, "A review of feature selection techniques in bioinformatics," bioinformatics, Vol.23, No.19(2007), pp.2507-2517.   DOI   ScienceOn
5 Thomas, J.G., J.M. Olson, S.J. Tapscott, and L.P. Zhao, "An efficient and robust statistical modeling approach to discover differentially expressed genes using genomic expression profiles," Genome Research, Vol.11, No.7(2001), pp.1227-1236.   DOI   ScienceOn
6 Toregas, C., R. Swain, C. ReVelle, and L. Bergman, "The location of emergency service facilities," Operations Research, Vol.19, No.6(1971), pp.1363-1373.   DOI   ScienceOn
7 Zhuang, J., M. Widschwendter, and A.E. Teschendorff, "A comparison of feature selection and classification methods in DNA methylation studies using the illumina infinium platform," BMC bioinformatics, Vol.13, No.1(2012), p.59.   DOI
8 Wang, Z., I.C. Yuan-chin, Z. Ying, L. Zhu, and Y. Yang, "A parsimonious threshold- independent protein feature selection method through the area under receiver operating characteristic curve," Bioinformatics, Vol.23, No.20(2007), pp.2788-2794.   DOI   ScienceOn
9 Zhang, X., X. Lu, Q. Shi, X. Xu, E.L. Honchiu, L.N. Harris, J.D. Iglehart, A. Miron, J.S. Liu, and W.H. Wong, "Recursive SVM feature selection and sample classification for mass-spectrometry and microarray data," BMC bioinformatics, Vol.7, No.1(2006), p.197.
10 Alexe, G., S. Alexe, P.L. Hammer, and B. Vizvari, "Pattern-based feature selections in genomics and proteomics," Annals of Operations Research, Vol.148(2006), pp.189-201.   DOI
11 Alexe, G., S. Alexe, D.E. Axelrod, P.L. Hammer, and D. Weissmann, "Logical analysis of diffuse large B-cell lymphomas," Artificial Intelligence in Medicine, Vol.34 (2005), pp.235-267.   DOI
12 Alexe, G., S. Alexe, D.E. Axelrod, T.O. Bonates, I.I. Lozina, M. Reiss, and P.L. Hammer, "Breast cancer prognosis by combinatorial analysis of gene expression data," Breast Cancer Research, Vol.8, No.4(2006), p.R41.   DOI
13 Alexe, G., S. Alexe, L.A. Liotta, E. Petricoin, M. Reiss, and P.L. Hammer, "Ovarian cancer detection by logical analysis of proteomic data," Proteomics, Vol.4(2004), pp.766-783.   DOI   ScienceOn
14 Apiletti, D., E. Baralis, G. Bruno, and A. Fiori, "MaskedPainter: Feature selection for microarray data analysis," Intelligent Data Analysis, (2012), pp.717-737.
15 Ayers, K.L. and H.J. Cordell, "SNP selection in genome-wide and candidate gene studies via penalized logistic regression," Genetic epidemiology, Vol.34, No.8(2010), pp.879-891.   DOI
16 Baralis, E., G. Bruno, and A. Fiori, "Maximum number of genes for microarray feature selection," 30th Annual International IEEE EMBS Conference, 2008.
17 Bertolazzi, P., G. Felici, P. Festa, and G. Lancia, "Logic classification and feature selection for biomedical data," Computers and Mathematics with Applications, (2008), pp.889-899.
18 Dʹiaz-Uriarte, R. and S.A. De Andres, "Gene selection and classification of microarray data using random forest," BMC bioinformatics, Vol.7, No.1(2006), p.3.   DOI
19 Boros, E., P.L. Hammer, T. Ibaraki, A. Kogan, E. Mayoraz, and I. Muchnik, "An implementation of logical analysis of data," Knowledge and Data Engineering, IEEE Transactions on, Vol.12, No.2(2000), pp.292-306.   DOI   ScienceOn
20 Chvatal, V., "A greedy heuristic for the setcovering problem," Mathematics of operations research, Vol.4, No.3(1979), pp.233-235.   DOI
21 Li, L., C.R. Weinberg, T.A. Darden, and L.G. Pedersen, "Gene selection for sample classification based on gene expression data : study of sensitivity to choice of parameters of the ga/knn method," Bioinformatics, Vol.17, No.12(2001), pp.1131-1142.   DOI   ScienceOn
22 Ding, C. and H. Peng, "Minimum redundancy feature selection from microarray gene expression data," Journal of bioinformatics and computational biology, Vol.3, No.2(2005), pp.185-205.   DOI   ScienceOn
23 Guyon, I., J. Weston, S. Barnhill, and V. Vapnik, "Gene selection for cancer classification using support vector machines," Machine learning, Vol.46, No.1-3(2002), pp.389-422.   DOI
24 Hall, M., E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I.H. Witten, "The weka data mining software : an update," ACM SIGKDD explorations newsletter, Vol.11, No.1(2009), pp.10-18.   DOI
25 Liu, H., J. Li, and L. Wong, "A comparative study on feature selection and classification methods using gene expression profiles and proteomic patterns," Genome Informatics Series, (2002), pp.51-60.
26 Zhang, H.H., J. Ahn, X. Lin, and C. Park, "Gene selection using support vector machines with non-convex penalty," Bioinformatics, Vol.22, No.1(2006), pp.88-95.   DOI   ScienceOn
27 Model, F., P. Adorjan, A. Olek, and C. Piepenbrock, "Feature selection for DNA methylation based cancer classification," Bioinformatics, Vol.17, No.1(2001), pp.S157-S164.   DOI