[KSCI] Korea Science Citation Index Service

http://dx.doi.org/10.7469/JKSQM.2019.47.3.537

Association-based Unsupervised Feature Selection for High-dimensional Categorical Data

Lee, Changki (College of Business Administration, Dongguk University)
Jung, Uk (College of Business Administration, Dongguk University)

Publication Information

Journal of Korean Society for Quality Management / v.47, no.3, 2019 , pp. 537-552 More about this Journal

Abstract

Purpose: The development of information technology makes it easy to utilize high-dimensional categorical data. In this regard, the purpose of this study is to propose a novel method to select the proper categorical variables in high-dimensional categorical data. Methods: The proposed feature selection method consists of three steps: (1) The first step defines the goodness-to-pick measure. In this paper, a categorical variable is relevant if it has relationships among other variables. According to the above definition of relevant variables, the goodness-to-pick measure calculates the normalized conditional entropy with other variables. (2) The second step finds the relevant feature subset from the original variables set. This step decides whether a variable is relevant or not. (3) The third step eliminates redundancy variables from the relevant feature subset. Results: Our experimental results showed that the proposed feature selection method generally yielded better classification performance than without feature selection in high-dimensional categorical data, especially as the number of irrelevant categorical variables increase. Besides, as the number of irrelevant categorical variables that have imbalanced categorical values is increasing, the difference in accuracy between the proposed method and the existing methods being compared increases. Conclusion: According to experimental results, we confirmed that the proposed method makes it possible to consistently produce high classification accuracy rates in high-dimensional categorical data. Therefore, the proposed method is promising to be used effectively in high-dimensional situation.

Keywords

Feature Selection; High-dimensional Categorical Data; Association-based Dissimilarity; Distance Metric; Unsupervised Learning;

Citations & Related Records

Times Cited By KSCI : 2 (Citation Analysis)

Reference
Cited By KSCI

1	Blum, A. L., and Langley, P. 1997. "Selection of Relevant Features and Examples in Machine Learning." Artificial intelligence 97(1-2):245-271. DOI
2	Burnaby, T. P. 1970. "On a Method for Character Weighting a Similarity Coefficient, Employing the Concept of Information." Journal of the International Association for Mathematical Geology 2(1):25-38. DOI
3	Chakraborty, D. D. 2008. "Statistical Decision Theory. Estimation, Testing and Selection." Investigación Operacional 29(2):184-185.
4	Cheng, V., Li, C. H., Kwok, J. T., and Li, C. K. 2004. "Dissimilarity learning for nominal data." Pattern Recognition 37(7):1471-1477. DOI
5	Chong, H. R., Hong, S. H., Lee, M. K., and Kwon, H. M. 2017. "Quality Management on the 4th Industrial Revolution." Journal of the Korean Society for Quality Management 45(4):629-648. DOI
6	Cost, S., and Salzberg, S. 1993. "A Weighted Nearest Neighbor Algorithm for Learning with Symbolic Features." Machine learning 10(1):57-78. DOI
7	Goodall, D. W. 1966. "A New Similarity Index Based on Probability." Biometrics, 882-907.
8	Guyon, I., and Elisseeff, A. 2003. "An Introduction to Variable and Feature Sselection." Journal of Machine Learning Research, 3(Mar), 1157-1182.
9	Hamming, R. W. 1950. "Error Detecting and Error Correcting Codes." Bell System Technical Journal, 29(2):147-160. DOI
10	Jia, H., Cheung, Y. M., and Liu, J. 2016. "A New Distance Metric for Unsupervised Learning of Categorical Data." IEEE Transactions on Neural Networks and Learning Systems 27(5):1065-1079. DOI
11	Kullback, S., and Leibler, R. A. 1951. "On Information and Sufficiency." The Annals of Mathematical Statistics 22(1):79-86. DOI
12	Le, S. Q., and Ho, T. B. 2005. "An Association-based Dissimilarity Measure for Categorical Data." Pattern Recognition Letters 26(16):2549-2557. DOI
13	Lin, J. 1991. "Divergence Measures Based on the Shannon Entropy." IEEE Transactions on Information Theory 37(1):145-151. DOI
14	Lin, D. 1998. "An Information-theoretic Definition of Similarity." In Icml 98(1998):296-304.
15	Liu, H., Sun, J., Liu, L., and Zhang, H. 2009. "Feature Selection with Dynamic Mutual Information." Pattern Recognition 42(7):1330-1339. DOI
16	Mitra, P., Murthy, C. A., and Pal, S. K. 2002. "Unsupervised Feature Selection Using Feature Similarity." IEEE Transactions on Pattern Analysis and Machine Intelligence 24(3):301-312. DOI
17	Il seok Oh. 2008. Pattern recognition, Kyobo book.
18	Park, Y. J., and Kim, S. B. 2014. "Unsupervised Feature Selection Method Based on Principal Component Loading Vectors." Journal of Korean Institute of Industrial Engineers 40(3):275-282. DOI
19	Quinlan, J. R. 2014. C4. 5: Programs for Machine Learning. Elsevier.
20	Ree, S. 2017. "Proposal of Korean Quality Management in the 4th Industrial Revolution." Journal of the Korean Society for Quality Management 45(4):739-760. DOI
21	Smirnov, E. S. 1968. "On Exact Methods in Systematics." Systematic Biology 17(1):1-13. DOI
22	Stanfill, C., and Waltz, D. L. 1986. "Toward Memory-based reasoning. Commun." ACM, 29(12):1213-1228. DOI
23	Tibshirani, R. 1996. "Regression Shrinkage and Selection Via the Lasso." Journal of the Royal Statistical Society: Series B (Methodological) 58(1):267-288. DOI
24	Vergara, J. R., and Estévez, P. A. 2014. "A Review of Feature Selection Methods Based on Mutual Information." Neural Computing and Applications 24(1):175-186. DOI
25	Xie, J., Szymanski, B., and Zaki, M. 2010. Learning Dissimilarities for Categorical Symbols. In Feature Selection in Data Mining:97-106.

KSCI

Association-based Unsupervised Feature Selection for High-dimensional Categorical Data 고차원 범주형 자료를 위한 비지도 연관성 기반 범주형 변수 선택 방법

Association-based Unsupervised Feature Selection for High-dimensional Categorical Data