Browse > Article
http://dx.doi.org/10.7469/JKSQM.2019.47.3.537

Association-based Unsupervised Feature Selection for High-dimensional Categorical Data  

Lee, Changki (College of Business Administration, Dongguk University)
Jung, Uk (College of Business Administration, Dongguk University)
Publication Information
Abstract
Purpose: The development of information technology makes it easy to utilize high-dimensional categorical data. In this regard, the purpose of this study is to propose a novel method to select the proper categorical variables in high-dimensional categorical data. Methods: The proposed feature selection method consists of three steps: (1) The first step defines the goodness-to-pick measure. In this paper, a categorical variable is relevant if it has relationships among other variables. According to the above definition of relevant variables, the goodness-to-pick measure calculates the normalized conditional entropy with other variables. (2) The second step finds the relevant feature subset from the original variables set. This step decides whether a variable is relevant or not. (3) The third step eliminates redundancy variables from the relevant feature subset. Results: Our experimental results showed that the proposed feature selection method generally yielded better classification performance than without feature selection in high-dimensional categorical data, especially as the number of irrelevant categorical variables increase. Besides, as the number of irrelevant categorical variables that have imbalanced categorical values is increasing, the difference in accuracy between the proposed method and the existing methods being compared increases. Conclusion: According to experimental results, we confirmed that the proposed method makes it possible to consistently produce high classification accuracy rates in high-dimensional categorical data. Therefore, the proposed method is promising to be used effectively in high-dimensional situation.
Keywords
Feature Selection; High-dimensional Categorical Data; Association-based Dissimilarity; Distance Metric; Unsupervised Learning;
Citations & Related Records
Times Cited By KSCI : 2  (Citation Analysis)
연도 인용수 순위
1 Blum, A. L., and Langley, P. 1997. "Selection of Relevant Features and Examples in Machine Learning." Artificial intelligence 97(1-2):245-271.   DOI
2 Burnaby, T. P. 1970. "On a Method for Character Weighting a Similarity Coefficient, Employing the Concept of Information." Journal of the International Association for Mathematical Geology 2(1):25-38.   DOI
3 Chakraborty, D. D. 2008. "Statistical Decision Theory. Estimation, Testing and Selection." Investigación Operacional 29(2):184-185.
4 Cheng, V., Li, C. H., Kwok, J. T., and Li, C. K. 2004. "Dissimilarity learning for nominal data." Pattern Recognition 37(7):1471-1477.   DOI
5 Chong, H. R., Hong, S. H., Lee, M. K., and Kwon, H. M. 2017. "Quality Management on the 4th Industrial Revolution." Journal of the Korean Society for Quality Management 45(4):629-648.   DOI
6 Cost, S., and Salzberg, S. 1993. "A Weighted Nearest Neighbor Algorithm for Learning with Symbolic Features." Machine learning 10(1):57-78.   DOI
7 Goodall, D. W. 1966. "A New Similarity Index Based on Probability." Biometrics, 882-907.
8 Guyon, I., and Elisseeff, A. 2003. "An Introduction to Variable and Feature Sselection." Journal of Machine Learning Research, 3(Mar), 1157-1182.
9 Hamming, R. W. 1950. "Error Detecting and Error Correcting Codes." Bell System Technical Journal, 29(2):147-160.   DOI
10 Jia, H., Cheung, Y. M., and Liu, J. 2016. "A New Distance Metric for Unsupervised Learning of Categorical Data." IEEE Transactions on Neural Networks and Learning Systems 27(5):1065-1079.   DOI
11 Kullback, S., and Leibler, R. A. 1951. "On Information and Sufficiency." The Annals of Mathematical Statistics 22(1):79-86.   DOI
12 Le, S. Q., and Ho, T. B. 2005. "An Association-based Dissimilarity Measure for Categorical Data." Pattern Recognition Letters 26(16):2549-2557.   DOI
13 Lin, J. 1991. "Divergence Measures Based on the Shannon Entropy." IEEE Transactions on Information Theory 37(1):145-151.   DOI
14 Lin, D. 1998. "An Information-theoretic Definition of Similarity." In Icml 98(1998):296-304.
15 Liu, H., Sun, J., Liu, L., and Zhang, H. 2009. "Feature Selection with Dynamic Mutual Information." Pattern Recognition 42(7):1330-1339.   DOI
16 Mitra, P., Murthy, C. A., and Pal, S. K. 2002. "Unsupervised Feature Selection Using Feature Similarity." IEEE Transactions on Pattern Analysis and Machine Intelligence 24(3):301-312.   DOI
17 Il seok Oh. 2008. Pattern recognition, Kyobo book.
18 Park, Y. J., and Kim, S. B. 2014. "Unsupervised Feature Selection Method Based on Principal Component Loading Vectors." Journal of Korean Institute of Industrial Engineers 40(3):275-282.   DOI
19 Ree, S. 2017. "Proposal of Korean Quality Management in the 4th Industrial Revolution." Journal of the Korean Society for Quality Management 45(4):739-760.   DOI
20 Quinlan, J. R. 2014. C4. 5: Programs for Machine Learning. Elsevier.
21 Smirnov, E. S. 1968. "On Exact Methods in Systematics." Systematic Biology 17(1):1-13.   DOI
22 Stanfill, C., and Waltz, D. L. 1986. "Toward Memory-based reasoning. Commun." ACM, 29(12):1213-1228.   DOI
23 Tibshirani, R. 1996. "Regression Shrinkage and Selection Via the Lasso." Journal of the Royal Statistical Society: Series B (Methodological) 58(1):267-288.   DOI
24 Vergara, J. R., and Estévez, P. A. 2014. "A Review of Feature Selection Methods Based on Mutual Information." Neural Computing and Applications 24(1):175-186.   DOI
25 Xie, J., Szymanski, B., and Zaki, M. 2010. Learning Dissimilarities for Categorical Symbols. In Feature Selection in Data Mining:97-106.