DOI QR코드

DOI QR Code

Categorical Data Clustering Analysis Using Association-based Dissimilarity

연관성 기반 비유사성을 활용한 범주형 자료 군집분석

  • Lee, Changki (College of Business Administration, Dongguk University) ;
  • Jung, Uk (College of Business Administration, Dongguk University)
  • 이창기 (동국대학교 경영대학) ;
  • 정욱 (동국대학교 경영대학)
  • Received : 2019.03.10
  • Accepted : 2019.03.25
  • Published : 2019.06.30

Abstract

Purpose: The purpose of this study is to suggest a more efficient distance measure taking into account the relationship between categorical variables for categorical data cluster analysis. Methods: In this study, the association-based dissimilarity was employed to calculate the distance between two categorical data observations and the distance obtained from the association-based dissimilarity was applied to the PAM cluster algorithms to verify its effectiveness. The strength of association between two different categorical variables can be calculated using a mixture of dissimilarities between the conditional probability distributions of other categorical variables, given these two categorical values. In particular, this method is suitable for datasets whose categorical variables are highly correlated. Results: The simulation results using several real life data showed that the proposed distance which considered relationships among the categorical variables generally yielded better clustering performance than the Hamming distance. In addition, as the number of correlated variables was increasing, the difference in the performance of the two clustering methods based on different distance measures became statistically more significant. Conclusion: This study revealed that the adoption of the relationship between categorical variables using our proposed method positively affected the results of cluster analysis.

Keywords

References

  1. Burnaby, T. P. 1970. "On a method for character weighting a similarity coefficient, employing the concept of information." Journal of the International Association for Mathematical Geology 2(1):25-38. https://doi.org/10.1007/BF02332078
  2. Cha, S. H. 2007. "Comprehensive survey on distance/similarity measures between probability density functions." City 1(2):1.
  3. Chakraborty, D. D. 2008. Statistical decision theory. estimation, testing and selection. Investigacion Operacional 29(2):184-185.
  4. Esposito, F., Malerba, D., Tamma, V., & Bock, H. H. 2000. "Classical resemblance measures. Analysis of Symbolic Data. Exploratory methods for extracting statistical information from complex data," 15, 139-152.
  5. Goodall, D. W. 1966. "A new similarity index based on probability." Biometrics, 882-907.
  6. Hamming, R. W. 1950. "Error detecting and error correcting codes." Bell System technical journal 29(2):147-160. https://doi.org/10.1002/j.1538-7305.1950.tb00463.x
  7. Jia, H., Cheung, Y. M., & Liu, J. 2016. "A new distance metric for unsupervised learning of categorical data." IEEE transactions on neural networks and learning systems 27(5):1065-1079. https://doi.org/10.1109/TNNLS.2015.2436432
  8. Kaufman, L., & Rousseeuw, P. 1987. Clustering by means of medoids. North-Holland.
  9. Kullback, S., & Leibler, R. A. 1951. "On information and sufficiency." The annals of mathematical statistics 22(1):79-86. https://doi.org/10.1214/aoms/1177729694
  10. Le, S. Q., & Ho, T. B. 2005. "An association-based dissimilarity measure for categorical data." Pattern Recognition Letters 26(16):2549-2557. https://doi.org/10.1016/j.patrec.2005.06.002
  11. Lim, Y. B., Kim, S. I., Lee, S. B., & Jang, D. H. 2016. "Literature Review on the Statistical Methods in KSQM for 50 Years." Journal of the Korean Society for Quality Management 44(2):221-244. https://doi.org/10.7469/JKSQM.2016.44.2.221
  12. Lin, D. 1998. "An information-theoretic definition of similarity." In Icml 98(1998), 296-304.
  13. Lin, J. 1991. "Divergence measures based on the Shannon entropy." IEEE Transactions on Information theory, 37(1):145-151. https://doi.org/10.1109/18.61115
  14. Mahalanobis, P. C. 1936. On the generalized distance in statistics. National Institute of Science of India.
  15. Rand, W. M. 1971. "Objective criteria for the evaluation of clustering methods." Journal of the American Statistical association 66(336):846-850. https://doi.org/10.1080/01621459.1971.10482356
  16. Seo, M. K., & Yun, W. Y. 2017. "Clustering-based Monitoring and Fault detection in Hot Strip Roughing Mill." Journal of the Korean Society for Quality Management 45(1):25-38. https://doi.org/10.7469/JKSQM.2017.45.1.025
  17. Smirnov, E. S. 1968. "On exact methods in systematics." Systematic Biology 17(1):1-13. https://doi.org/10.1093/sysbio/17.1.1
  18. Suh, C. J., Kim, H.T., Kim, J.H., Kawk, Y.W.. 2013. Introduction to Management Quality: 1st edition: Parkyong.