Categorical Data Clustering Analysis Using Association-based Dissimilarity

Lee, Changki;Jung, Uk;

doi:10.7469/JKSQM.2019.47.2.271

Journal of Korean Society for Quality Management (품질경영학회지)

Volume 47 Issue 2
/
Pages.271-281
/
2019
/
1229-1889(pISSN)
/
2287-9005(eISSN)

Korean Society for Quality Management (한국품질경영학회)

DOI QR Code

Categorical Data Clustering Analysis Using Association-based Dissimilarity

연관성 기반 비유사성을 활용한 범주형 자료 군집분석

Lee, Changki (College of Business Administration, Dongguk University) ;
Jung, Uk (College of Business Administration, Dongguk University)

이창기 (동국대학교 경영대학) ;
정욱 (동국대학교 경영대학)

Received : 2019.03.10
Accepted : 2019.03.25
Published : 2019.06.30

https://doi.org/10.7469/JKSQM.2019.47.2.271 Citation PDF KSCI

Download PDF

⟨ Previous Next ⟩

Abstract

Purpose: The purpose of this study is to suggest a more efficient distance measure taking into account the relationship between categorical variables for categorical data cluster analysis. Methods: In this study, the association-based dissimilarity was employed to calculate the distance between two categorical data observations and the distance obtained from the association-based dissimilarity was applied to the PAM cluster algorithms to verify its effectiveness. The strength of association between two different categorical variables can be calculated using a mixture of dissimilarities between the conditional probability distributions of other categorical variables, given these two categorical values. In particular, this method is suitable for datasets whose categorical variables are highly correlated. Results: The simulation results using several real life data showed that the proposed distance which considered relationships among the categorical variables generally yielded better clustering performance than the Hamming distance. In addition, as the number of correlated variables was increasing, the difference in the performance of the two clustering methods based on different distance measures became statistically more significant. Conclusion: This study revealed that the adoption of the relationship between categorical variables using our proposed method positively affected the results of cluster analysis.

Keywords

References

Burnaby, T. P. 1970. "On a method for character weighting a similarity coefficient, employing the concept of information." Journal of the International Association for Mathematical Geology 2(1):25-38. https://doi.org/10.1007/BF02332078
Cha, S. H. 2007. "Comprehensive survey on distance/similarity measures between probability density functions." City 1(2):1.
Chakraborty, D. D. 2008. Statistical decision theory. estimation, testing and selection. Investigacion Operacional 29(2):184-185.
Esposito, F., Malerba, D., Tamma, V., & Bock, H. H. 2000. "Classical resemblance measures. Analysis of Symbolic Data. Exploratory methods for extracting statistical information from complex data," 15, 139-152.
Goodall, D. W. 1966. "A new similarity index based on probability." Biometrics, 882-907.
Hamming, R. W. 1950. "Error detecting and error correcting codes." Bell System technical journal 29(2):147-160. https://doi.org/10.1002/j.1538-7305.1950.tb00463.x
Jia, H., Cheung, Y. M., & Liu, J. 2016. "A new distance metric for unsupervised learning of categorical data." IEEE transactions on neural networks and learning systems 27(5):1065-1079. https://doi.org/10.1109/TNNLS.2015.2436432
Kaufman, L., & Rousseeuw, P. 1987. Clustering by means of medoids. North-Holland.
Kullback, S., & Leibler, R. A. 1951. "On information and sufficiency." The annals of mathematical statistics 22(1):79-86. https://doi.org/10.1214/aoms/1177729694
Le, S. Q., & Ho, T. B. 2005. "An association-based dissimilarity measure for categorical data." Pattern Recognition Letters 26(16):2549-2557. https://doi.org/10.1016/j.patrec.2005.06.002
Lim, Y. B., Kim, S. I., Lee, S. B., & Jang, D. H. 2016. "Literature Review on the Statistical Methods in KSQM for 50 Years." Journal of the Korean Society for Quality Management 44(2):221-244. https://doi.org/10.7469/JKSQM.2016.44.2.221
Lin, D. 1998. "An information-theoretic definition of similarity." In Icml 98(1998), 296-304.
Lin, J. 1991. "Divergence measures based on the Shannon entropy." IEEE Transactions on Information theory, 37(1):145-151. https://doi.org/10.1109/18.61115
Mahalanobis, P. C. 1936. On the generalized distance in statistics. National Institute of Science of India.
Rand, W. M. 1971. "Objective criteria for the evaluation of clustering methods." Journal of the American Statistical association 66(336):846-850. https://doi.org/10.1080/01621459.1971.10482356
Seo, M. K., & Yun, W. Y. 2017. "Clustering-based Monitoring and Fault detection in Hot Strip Roughing Mill." Journal of the Korean Society for Quality Management 45(1):25-38. https://doi.org/10.7469/JKSQM.2017.45.1.025
Smirnov, E. S. 1968. "On exact methods in systematics." Systematic Biology 17(1):1-13. https://doi.org/10.1093/sysbio/17.1.1
Suh, C. J., Kim, H.T., Kim, J.H., Kawk, Y.W.. 2013. Introduction to Management Quality: 1st edition: Parkyong.

Journal of Korean Society for Quality Management (품질경영학회지)

Categorical Data Clustering Analysis Using Association-based Dissimilarity

연관성 기반 비유사성을 활용한 범주형 자료 군집분석

Abstract

Keywords

References

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)