Browse > Article
http://dx.doi.org/10.5351/CKSS.2005.12.3.659

Discretization Method Based on Quantiles for Variable Selection Using Mutual Information  

CHa, Woon-Ock (Division of Computer Engineering, Hansung University)
Huh, Moon-Yul (Department of Statistics, Sungkyunkwan University)
Publication Information
Communications for Statistical Applications and Methods / v.12, no.3, 2005 , pp. 659-672 More about this Journal
Abstract
This paper evaluates discretization of continuous variables to select relevant variables for supervised learning using mutual information. Three discretization methods, MDL, Histogram and 4-Intervals are considered. The process of discretization and variable subset selection is evaluated according to the classification accuracies with the 6 real data sets of UCI databases. Results show that 4-Interval discretization method based on quantiles, is robust and efficient for variable selection process. We also visually evaluate the appropriateness of the selected subset of variables.
Keywords
variable selection; discretization; mutual information; data visualization;
Citations & Related Records
Times Cited By KSCI : 1  (Citation Analysis)
연도 인용수 순위
1 Cover, T. M. and Thomas, J. A(1991). Elements of Information Theory, Wiley, New York
2 Dash, M. and Liu, H.(1997). Feature selection for classification, Intelligent Data Analysis, Elsevier Science Inc
3 Devijver, P. A and Kittler, J.(1982). Pattern Recognition : A Statistical Approach, Prentice Hall International
4 Fayyad, U. M. and Irani, K. B.(1992). On the Handling of Continuous-valued Attributes in Decision Tree Generation, Machine Learning, Vol. 8, 87-192
5 Huh, M. Y.(2005). DAVIS(http://stat.skku.ac.kr/myhuh/DAVIS.html)
6 Ihaka, R and Gentleman, R(1996). R:A language for data analysis and graphics, Journal of Computational and Graphical Statistics, 5(3), 299-314. (http://www.r-project.org)   DOI   ScienceOn
7 Liu, H. and Motoda, H.(1998). Feature Selection for Knowledge Discovery and Data Mining, Kluwer Academic Publishers
8 Merz, C. J. and Murphy, P. M.(1996). UCI Repository of Machine Learning Databases, Department of Information and Computer Science, University of California, Irvine, CA (http.//www.ics.uci.edu/~mlearn/MLRepository.html)
9 Parzen, E.(1962). On the estimation of probability density function and mode, Annals of Mathematical Statistics, 33(3), 1065-1076   DOI   ScienceOn
10 Battiti, R.(1994). Using mutual information for selecting features in supervised neural net learning, IEEE Transactions on Neural Networks, 5, 537-550   DOI   ScienceOn
11 Venables, W. N. and Ripley, B. D.(1994). Modern Applied Statistics with S-Plus, Springer, New York
12 Witten, I. and Frank, E.(1999). Data Mining, Morgan and Kaufmann. (http://www.cs.waikato.ac.nz/ml/weka)
13 Cha, W. and Huh, M.(2003). Evaluation of Attribute Selection Methods and Prior Discretization in Supervised Learning, 한국통계학회 논문집, Vol. 10, No.3, 879-894   DOI   ScienceOn
14 Tourassi, G. D., Frederick, E. D., Markey, M. K. and Floyed, C. E., Jr.(2001), Application of the mutual information criterion for feature selection in computer-aided diagnosis, Medical Physics, 28(12), 2394-2402   DOI   ScienceOn
15 Bonnlander, B. V. and Weigend, A S.(1994). Selecting Input Variables Using mutual Information and Nonparametric Density Estimation, Proceedings of the International Symposium on Artificial Neural Networks(ISANN), Taiwan, 42-50
16 Breiman, L., Friedman, J. H., Olshen, R. A and Stone, C. J.(1984). Classification and regression trees, Wardsworth, Belmont, CA