Browse > Article
http://dx.doi.org/10.5351/KJAS.2015.28.3.407

Categorical Variable Selection in Naïve Bayes Classification  

Kim, Min-Sun (Department of Statistics, University of Seoul)
Choi, Hosik (Department of Applied and Informational Statistics, Kyonggi University)
Park, Changyi (Department of Statistics, University of Seoul)
Publication Information
The Korean Journal of Applied Statistics / v.28, no.3, 2015 , pp. 407-415 More about this Journal
Abstract
$Na{\ddot{i}}ve$ Bayes Classification is based on input variables that are a conditionally independent given output variable. The $Na{\ddot{i}}ve$ Bayes assumption is unrealistic but simplifies the problem of high dimensional joint probability estimation into a series of univariate probability estimations. Thus $Na{\ddot{i}}ve$ Bayes classier is often adopted in the analysis of massive data sets such as in spam e-mail filtering and recommendation systems. In this paper, we propose a variable selection method based on ${\chi}^2$ statistic on input and output variables. The proposed method retains the simplicity of $Na{\ddot{i}}ve$ Bayes classier in terms of data processing and computation; however, it can select relevant variables. It is expected that our method can be useful in classification problems for ultra-high dimensional or big data such as the classification of diseases based on single nucleotide polymorphisms(SNPs).
Keywords
big data; ${\chi}^2$ statistic; $Na{\ddot{i}}ve$ Bayes assumption; SNP;
Citations & Related Records
Times Cited By KSCI : 2  (Citation Analysis)
연도 인용수 순위
1 Chen, J. and Gupta, A. K. (2000). Parametric Statistical Change Point Analysis, Birkhauser.
2 Choi, B.-J., Kim, K.-R., Cho, K.-D., Park, C. and Koo, J.-Y. (2014). Variable selection for Naive Bayes Semisupervised learning, Communications in Statistics - Simulation and Computation, 43, 2702-2713.   DOI   ScienceOn
3 Fan, J. and Lv, J. (2008). Sure independence screening for ultrahigh dimensional feature space, Journal of the Royal Statistical Society, 70, 849-911.   DOI   ScienceOn
4 Ha, J. H. and Park, C. (2009). Variable selection in linear discriminant analysis, Journal of the Korean Data Analysis Society, 11, 381-389.
5 Hand, D. and Yu, K. (2001). Idiot's Bayes-not so stupid at all?, International Statistical Review, 69, 385-399.
6 Hastie, T., Tibshirani, R. and Friedman, J. (2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction, (2nd Edition), Springer, New York.
7 Jin, S. K., Kim, K.-R. and Park, C. (2012). Cutpoint Selection via penalization in credit scoring, The Korean Journal of Applied Statistics, 25, 261-267.   DOI   ScienceOn
8 Killick, R., Fearnhead, P. and Eckley, I. A. (2012). Optimal detection of changepoints with a linear computational cost, Journal of the American Statistical Association, 107, 1590-1598.   DOI
9 Killick, R. and Eckley, I. A. (2014). Changepoint: An R package for changepoint analysis, Journal of Statistical Software, 58.
10 Vidaurre, D., Bielza, C. and Larranaga, P. (2012). Forward stagewise naive Bayes, Progress in Artificial Intelligence, 1, 57-69.   DOI
11 Vidaurre, D., Bielza, C. and Larranaga, P. (2013). An $L_1$-regularized naive Bayes-inspired classifier for discarding redundant and irrelevant predictors, International Journal on Artificial Intelligence Tools, 22, 1350019.   DOI