Browse > Article
http://dx.doi.org/10.5916/jkosme.2016.40.5.437

A comparative study of filter methods based on information entropy  

Kim, Jung-Tae (Department of Data Information, Korea Maritime and Ocean University)
Kum, Ho-Yeun (Department of Data Information, Korea Maritime and Ocean University)
Kim, Jae-Hwan (Department of Data Information, Korea Maritime and Ocean University)
Abstract
Feature selection has become an essential technique to reduce the dimensionality of data sets. Many features are frequently irrelevant or redundant for the classification tasks. The purpose of feature selection is to select relevant features and remove irrelevant and redundant features. Applications of the feature selection range from text processing, face recognition, bioinformatics, speaker verification, and medical diagnosis to financial domains. In this study, we focus on filter methods based on information entropy : IG (Information Gain), FCBF (Fast Correlation Based Filter), and mRMR (minimum Redundancy Maximum Relevance). FCBF has the advantage of reducing computational burden by eliminating the redundant features that satisfy the condition of approximate Markov blanket. However, FCBF considers only the relevance between the feature and the class in order to select the best features, thus failing to take into consideration the interaction between features. In this paper, we propose an improved FCBF to overcome this shortcoming. We also perform a comparative study to evaluate the performance of the proposed method.
Keywords
Metaheuristics; Improved tabu search; Subset selection problem;
Citations & Related Records
Times Cited By KSCI : 1  (Citation Analysis)
연도 인용수 순위
1 M. Hall, "Correlation-based feature selection for machine learning", PhD thesis, Citeseer, 1999.
2 Z. Zhao, H. Liu, "Searching for interacting features," International Joint Conference on Artificial Intelligence, vol. 7, pp. 1156-1161, 2007.
3 I. Guyon, J. Weston, S. Barnhill, and V. Vapnik, "Gene selection for cancer classification using support vector machines," Machine Learning, vol. 46, pp. 389-422, 2002.   DOI
4 S. Maldonado, R. Weber, and J. Basak, "Simultaneous feature selection and classification using kernel-penalized support vector machines," Information Sciences, vol. 181 no.1, pp. 115-128, 2011.   DOI
5 J. G. Bae, J. T. Kim, and J. H. Kim, "Subset selection in multiple linear regression: an improved tabu search," Journal of Korean Society of Marine Engineering, vol. 40, no. 2, pp. 138-145, 2016.   DOI
6 I. Inza, B. Sierra, R. Blanco, and P. Larranaga, "Gene selection by sequential search wrapper approaches in microarray cancer class prediction," Journal of Intelligent and Fuzzy Systems, vol. 12, no. 1, pp. 25-33, 2002.
7 R. Ruiz, J. Riquelme, and J. Aguilar-Ruiz, "Incremental wrapper-based gene selection from microarray data for cancer classification," Pattern Recognition, vol. 39, no. 12, pp. 2383-2392, 2006.   DOI
8 S. Shreem, S. Abdullah, M. Nazri, and M. Alzaqebah, "Hybridizing ReliefF, mRMR filters and GA wrapper approaches for gene selection," Journal of Theoretical and Applied Information Technology, vol. 46, no. 2, pp. 1034-1039, 2012.
9 L. Chuang, C. Yang, K. Wu, and C. Yang, "A hybrid feature selection method for DNA microarray data," Computers in Biology and Medicine, vol. 41, no. 4, pp. 228-237, 2011.   DOI
10 W. Aiguo, A. Ning, C. Guilin, and L. Lian, "Hybridizing mRMR and harmony search for gene selection and classification of microarray data," Journal of Computational Information Systems, vol. 11, no. 5, pp. 1563-1570, 2015.
11 V. Vapnik, "Support-vector networks," Machine Learning, vol. 20, pp. 273-297, 1995.
12 J. Demsar, B. Zupan, M. W. Kattan, J. R. Beck, and I. Bratko, "Naive bayesian-based nomogram for prediction of prostate cancer recurrence," Studies in Health Technology and Informatics, vol. 68, pp. 436-441, 1999.
13 H. Sun, "A naive Bayes classifier for prediction of multidrug resistance reversal activity on the basis of atom typing," Journal of Medicinal Chemistry, vol. 48, no. 12, pp. 4031-4039, 2005.   DOI
14 T. M. Cover and P. E. Hart, "Nearest neighbor pattern classification," IEEE Transactions on Information Theory, vol. 13, no. 1 pp. 21-27, 1967.   DOI
15 J. N. Morgan and J. A. Sonquist, "Problems in the analysis of survey data, and a proposal," Journal of the American Statistical Association, vol. 58, no. 302, pp. 415-434, 1963.   DOI
16 J. A. Hartigrn, Clustering Algorithms, Wiley, New York, 1975.
17 L.E. Raileanu and K. Stoffel, "Theoretical comparison between the Gini Index and information gain criteria," Annals of Mathematics and Artificial Intelligence, vol. 41 no. 1, pp. 77-93, 2004.   DOI
18 Q. Gu, Z. Li, and J. Han, "Generalized fisher score for feature selection," Proceedings of the International Conference on Uncertainty in Artificial Intelligence, 2011.
19 M. Hall and L. Smith, "Practical feature subset selection for machine learning," Computer Science, Vol. 98, pp. 181-191, 1998
20 J. Yang, Y. Liu, Z. Liu, X. Zhu, and X. Zhang, "A new feature selection algorithm based on binomial hypothesis testing for spam filtering," Knowledge-Based Systems, vol. 24, no. 6, pp. 904-914, 2011.   DOI
21 X. He, D. Cai, and P. Niyogi, "Laplacian score for feature selection," Advances in neural information processing systems, pp. 507-514, 2005.
22 K. Kira and L. Rendell, "The feature selection problem: traditional methods and a new algorithm," Proceedings of the Tenth National Conference on Artificial intelligence, AAAI Press, San Jose, CA, vol. 2, pp. 129-134. 1992.
23 L. Yu and H. Liu, "Feature selection for high-dimensional data: a fast correlation-based filter solution," Proceedings of the Twentieth International Conference on Machine Learning, vol. 3, pp. 856-863, 2003.
24 H. Peng, F. Long, and C. Ding, "Feature selection based on mutual information criteria of max-dependency, maxrelevance, and min-redundancy," IEEE Transactions on pattern analysis and machine intelligence, vol. 27, no. 8, pp. 1226-1238, 2005.   DOI
25 J. R. Quinlan, "Induction of decision trees," Machine Learning, vol. 1, no. 1, pp. 81-106, 1986.   DOI
26 C. Ambroise and G. McLachlan, "Selection bias in gene extraction on the basis of microarray gene-expression data," proceedings of the National Academy of Sciences, vol. 99, no. 10, pp. 6562-6566, 2002.   DOI
27 L. J. Vant't Veer et al, "Gene expression profiling predicts clinical outcome of breast cancer," Nature, vol. 415, no. 6871, pp. 530-536, 2002.   DOI
28 A. A. Alizadeh et al, "Distinct types of diffuse large B-cell lymphoma identitfied by gene expression profiling," Nature, vol. 403, no. 6769, pp. 503-511, 2000.   DOI
29 U. Scherf et al, "A cDNA microarray gene expression database for the molecular pharmacology of cancer," vol. 24, no. 3, pp. 236-244, 2000.   DOI