Browse > Article
http://dx.doi.org/10.7469/JKSQM.2019.47.2.271

Categorical Data Clustering Analysis Using Association-based Dissimilarity  

Lee, Changki (College of Business Administration, Dongguk University)
Jung, Uk (College of Business Administration, Dongguk University)
Publication Information
Abstract
Purpose: The purpose of this study is to suggest a more efficient distance measure taking into account the relationship between categorical variables for categorical data cluster analysis. Methods: In this study, the association-based dissimilarity was employed to calculate the distance between two categorical data observations and the distance obtained from the association-based dissimilarity was applied to the PAM cluster algorithms to verify its effectiveness. The strength of association between two different categorical variables can be calculated using a mixture of dissimilarities between the conditional probability distributions of other categorical variables, given these two categorical values. In particular, this method is suitable for datasets whose categorical variables are highly correlated. Results: The simulation results using several real life data showed that the proposed distance which considered relationships among the categorical variables generally yielded better clustering performance than the Hamming distance. In addition, as the number of correlated variables was increasing, the difference in the performance of the two clustering methods based on different distance measures became statistically more significant. Conclusion: This study revealed that the adoption of the relationship between categorical variables using our proposed method positively affected the results of cluster analysis.
Keywords
Association-based Dissimilarity; Distance Metric; Unsupervised Learning; Categorical Data; Clustering;
Citations & Related Records
Times Cited By KSCI : 2  (Citation Analysis)
연도 인용수 순위
1 Burnaby, T. P. 1970. "On a method for character weighting a similarity coefficient, employing the concept of information." Journal of the International Association for Mathematical Geology 2(1):25-38.   DOI
2 Cha, S. H. 2007. "Comprehensive survey on distance/similarity measures between probability density functions." City 1(2):1.
3 Chakraborty, D. D. 2008. Statistical decision theory. estimation, testing and selection. Investigacion Operacional 29(2):184-185.
4 Esposito, F., Malerba, D., Tamma, V., & Bock, H. H. 2000. "Classical resemblance measures. Analysis of Symbolic Data. Exploratory methods for extracting statistical information from complex data," 15, 139-152.
5 Goodall, D. W. 1966. "A new similarity index based on probability." Biometrics, 882-907.
6 Hamming, R. W. 1950. "Error detecting and error correcting codes." Bell System technical journal 29(2):147-160.   DOI
7 Jia, H., Cheung, Y. M., & Liu, J. 2016. "A new distance metric for unsupervised learning of categorical data." IEEE transactions on neural networks and learning systems 27(5):1065-1079.   DOI
8 Kullback, S., & Leibler, R. A. 1951. "On information and sufficiency." The annals of mathematical statistics 22(1):79-86.   DOI
9 Le, S. Q., & Ho, T. B. 2005. "An association-based dissimilarity measure for categorical data." Pattern Recognition Letters 26(16):2549-2557.   DOI
10 Lim, Y. B., Kim, S. I., Lee, S. B., & Jang, D. H. 2016. "Literature Review on the Statistical Methods in KSQM for 50 Years." Journal of the Korean Society for Quality Management 44(2):221-244.   DOI
11 Lin, D. 1998. "An information-theoretic definition of similarity." In Icml 98(1998), 296-304.
12 Lin, J. 1991. "Divergence measures based on the Shannon entropy." IEEE Transactions on Information theory, 37(1):145-151.   DOI
13 Mahalanobis, P. C. 1936. On the generalized distance in statistics. National Institute of Science of India.
14 Suh, C. J., Kim, H.T., Kim, J.H., Kawk, Y.W.. 2013. Introduction to Management Quality: 1st edition: Parkyong.
15 Rand, W. M. 1971. "Objective criteria for the evaluation of clustering methods." Journal of the American Statistical association 66(336):846-850.   DOI
16 Seo, M. K., & Yun, W. Y. 2017. "Clustering-based Monitoring and Fault detection in Hot Strip Roughing Mill." Journal of the Korean Society for Quality Management 45(1):25-38.   DOI
17 Smirnov, E. S. 1968. "On exact methods in systematics." Systematic Biology 17(1):1-13.   DOI
18 Kaufman, L., & Rousseeuw, P. 1987. Clustering by means of medoids. North-Holland.