Categorical Data Clustering Analysis Using Association-based Dissimilarity

Lee, Changki;Jung, Uk;

doi:10.7469/JKSQM.2019.47.2.271

품질경영학회지 (Journal of Korean Society for Quality Management)

제47권2호
/
Pages.271-281
/
2019
/
1229-1889(pISSN)
/
2287-9005(eISSN)

한국품질경영학회 (Korean Society for Quality Management)

DOI QR Code

연관성 기반 비유사성을 활용한 범주형 자료 군집분석

Categorical Data Clustering Analysis Using Association-based Dissimilarity

이창기 (동국대학교 경영대학) ;
정욱 (동국대학교 경영대학)

Lee, Changki (College of Business Administration, Dongguk University) ;
Jung, Uk (College of Business Administration, Dongguk University)

투고 : 2019.03.10
심사 : 2019.03.25
발행 : 2019.06.30

https://doi.org/10.7469/JKSQM.2019.47.2.271 인용 PDF KSCI

PDF 다운로드

⟨ 이전 논문 다음 논문 ⟩

초록

Purpose: The purpose of this study is to suggest a more efficient distance measure taking into account the relationship between categorical variables for categorical data cluster analysis. Methods: In this study, the association-based dissimilarity was employed to calculate the distance between two categorical data observations and the distance obtained from the association-based dissimilarity was applied to the PAM cluster algorithms to verify its effectiveness. The strength of association between two different categorical variables can be calculated using a mixture of dissimilarities between the conditional probability distributions of other categorical variables, given these two categorical values. In particular, this method is suitable for datasets whose categorical variables are highly correlated. Results: The simulation results using several real life data showed that the proposed distance which considered relationships among the categorical variables generally yielded better clustering performance than the Hamming distance. In addition, as the number of correlated variables was increasing, the difference in the performance of the two clustering methods based on different distance measures became statistically more significant. Conclusion: This study revealed that the adoption of the relationship between categorical variables using our proposed method positively affected the results of cluster analysis.

키워드

참고문헌

Burnaby, T. P. 1970. "On a method for character weighting a similarity coefficient, employing the concept of information." Journal of the International Association for Mathematical Geology 2(1):25-38. https://doi.org/10.1007/BF02332078
Cha, S. H. 2007. "Comprehensive survey on distance/similarity measures between probability density functions." City 1(2):1.
Chakraborty, D. D. 2008. Statistical decision theory. estimation, testing and selection. Investigacion Operacional 29(2):184-185.
Esposito, F., Malerba, D., Tamma, V., & Bock, H. H. 2000. "Classical resemblance measures. Analysis of Symbolic Data. Exploratory methods for extracting statistical information from complex data," 15, 139-152.
Goodall, D. W. 1966. "A new similarity index based on probability." Biometrics, 882-907.
Hamming, R. W. 1950. "Error detecting and error correcting codes." Bell System technical journal 29(2):147-160. https://doi.org/10.1002/j.1538-7305.1950.tb00463.x
Jia, H., Cheung, Y. M., & Liu, J. 2016. "A new distance metric for unsupervised learning of categorical data." IEEE transactions on neural networks and learning systems 27(5):1065-1079. https://doi.org/10.1109/TNNLS.2015.2436432
Kaufman, L., & Rousseeuw, P. 1987. Clustering by means of medoids. North-Holland.
Kullback, S., & Leibler, R. A. 1951. "On information and sufficiency." The annals of mathematical statistics 22(1):79-86. https://doi.org/10.1214/aoms/1177729694
Le, S. Q., & Ho, T. B. 2005. "An association-based dissimilarity measure for categorical data." Pattern Recognition Letters 26(16):2549-2557. https://doi.org/10.1016/j.patrec.2005.06.002
Lim, Y. B., Kim, S. I., Lee, S. B., & Jang, D. H. 2016. "Literature Review on the Statistical Methods in KSQM for 50 Years." Journal of the Korean Society for Quality Management 44(2):221-244. https://doi.org/10.7469/JKSQM.2016.44.2.221
Lin, D. 1998. "An information-theoretic definition of similarity." In Icml 98(1998), 296-304.
Lin, J. 1991. "Divergence measures based on the Shannon entropy." IEEE Transactions on Information theory, 37(1):145-151. https://doi.org/10.1109/18.61115
Mahalanobis, P. C. 1936. On the generalized distance in statistics. National Institute of Science of India.
Rand, W. M. 1971. "Objective criteria for the evaluation of clustering methods." Journal of the American Statistical association 66(336):846-850. https://doi.org/10.1080/01621459.1971.10482356
Seo, M. K., & Yun, W. Y. 2017. "Clustering-based Monitoring and Fault detection in Hot Strip Roughing Mill." Journal of the Korean Society for Quality Management 45(1):25-38. https://doi.org/10.7469/JKSQM.2017.45.1.025
Smirnov, E. S. 1968. "On exact methods in systematics." Systematic Biology 17(1):1-13. https://doi.org/10.1093/sysbio/17.1.1
Suh, C. J., Kim, H.T., Kim, J.H., Kawk, Y.W.. 2013. Introduction to Management Quality: 1st edition: Parkyong.

품질경영학회지 (Journal of Korean Society for Quality Management)

연관성 기반 비유사성을 활용한 범주형 자료 군집분석

Categorical Data Clustering Analysis Using Association-based Dissimilarity

초록

키워드

참고문헌

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

자세히 찾기

이미지 검색 (β)