Browse > Article

A New Similarity Measure for Categorical Attribute-Based Clustering  

Kim, Min (한국과학기술연구원 인지로봇센터)
Jeon, Joo-Hyuk (한국과학기술원 전산학과)
Woo, Kyung-Gu (삼성전자 종합기술원 SW 선행연구소)
Kim, Myoung-Ho (한국과학기술원 전산학과)
Abstract
The problem of finding clusters is widely used in numerous applications, such as pattern recognition, image analysis, market analysis. The important factors that decide cluster quality are the similarity measure and the number of attributes. Similarity measures should be defined with respect to the data types. Existing similarity measures are well applicable to numerical attribute values. However, those measures do not work well when the data is described by categorical attributes, that is, when no inherent similarity measure between values. In high dimensional spaces, conventional clustering algorithms tend to break down because of sparsity of data points. To overcome this difficulty, a subspace clustering approach has been proposed. It is based on the observation that different clusters may exist in different subspaces. In this paper, we propose a new similarity measure for clustering of high dimensional categorical data. The measure is defined based on the fact that a good clustering is one where each cluster should have certain information that can distinguish it with other clusters. We also try to capture on the attribute dependencies. This study is meaningful because there has been no method to use both of them. Experimental results on real datasets show clusters obtained by our proposed similarity measure are good enough with respect to clustering accuracy.
Keywords
clustering; similarity measure; k-means clustering;
Citations & Related Records
연도 인용수 순위
  • Reference
1 Z. Huang, A fast clustering algorithm to cluster very large categorical data sets in data mining. Proceedings of SIGMOD Workshop on Research Issues on Data Mining and Knowledge Discovery, pp.1-8, 1997.
2 F. Cao, J. Liang and L. Bai, A new initialization method for categorical data clustering, Expert Systems With Applications: An International Journal archive, vol.36, Issue 7, pp.10223-102228, 2009.   DOI   ScienceOn
3 M. Al-Razgan, C. Domeniconi and D. Barbara, Random Subspace Ensembles for Clustering Categorical Data. Studies in Computational Intelligence, Springer, 2008.
4 B. Broda and M. Piasecki, Experiments in Clustering Documents for Automatic Acquisition of Lexical Semantic Networks for Polish, Proceedings of the 16th International Conference Intelligent Information Systems, 2008, pp.203-202, 2008.
5 A. M. Fahim, G. Saake, A. M. Salem, F. A. Torkey, and M. A. Ramadan, k-Means for Spherical Clusters with Large Variance in Sizes, Proceedings of World Academy of Science, Engineering and Technology, vol.35, pp.177-182, 2008.
6 K. Qin, M. Xu, Y. Du, and S. Yue, Cloud Model and Hierarchical Clustering Based Spatial Data Mining Method and Application, Proceedings of the International Archives of the Photogrammetry, Remote Sensing and Spatial information Sciences, vol.37, pp.241-246, 2008.
7 D. H. Fisher, Knowledge acquisition via incremental conceptual clustering. Machine Learning, vol.2, no.2, pp.139-172, 1987.
8 M. Gluck and J. Corter, Information, Uncertainty, and the Utility of Categories. Proceedings of Seventh Annual Conference of Cognitive Science Society, pp.283-287, 1985.
9 Z. Huang and M.K. Ng, A fuzzy k-modes algorithm for clustering categorical data. IEEE Transactions on Fuzzy Systems, vol.7, no.4, pp.446-452, 1999.   DOI   ScienceOn
10 K.B. McKusick and K. Thompson, COBWEB/3: A portable implementation, Report FIA-90-6-18-2, NASA, Ames Research Center, 1990.
11 Y. Reich and S.J. Fenves, The formation and use of abstract concepts in design. Concept Formation: Knowledge and Experience in Unsupervised Learning, Morgan Kaufmann, 1991.
12 T. Cover, J. Thomas, Elements of information theory, Wiley InterScience, 1991.
13 G. Biswas, J. Weinberg, and C. Li, ITERATE: A conceptual clustering scheme for knowledge discovery in databases. Artificial Intelligence in the Petroleum Industry, B. Braunschweig and R. Day eds., pp.111-139, 1995.
14 P. Andritsos, P. Tsaparas, R.J. Miller and K.C. Sevcik, LIMBO: Scalable clustering of categorical data. Proceedings of the 9th International Conference on Extending DataBase Technology (EDBT), 2004.
15 D. Barbara, Y. Li and J. Couto, COOLCAT: an entropy-based algorithm for categorical clustering. Proceedings of ACM Conf. on Information and Knowledge Mgt. (CIKM), pp.582-589, 2002.
16 D. Hochbaum and D. Shmoys, A best possible heuristic for the k-center problem. Mathematics of Operations Research, vol.10, no.2, pp.180-184, 1985.   DOI   ScienceOn
17 C. J. Merz and P. Merphy, UCI Repository of Machine Learning Databases, 1996. Available from: .
18 C. Ding, X. He, H. Zha, and H. D. Simon, Adaptive dimension reduction for clustering high dimensional data. Proceedings of Second IEEE International Conference on Data Mining, pp. 147-154, 2002.
19 L. Yu and H. Liu, Feature selection for highdimensional data: a fast correlation-based filter solution. Proceedings of the twentieth International Conference on Machine Learning, pp.856-863, 2003.
20 S. Raychaudhuri, P. D. Sutphin, J. T. Chang, and R. B. Altman, Basic microarray analysis: grouping and feature reduction. Trends in Biotechnology, vol.19, no.5, pp.189-193, 2001.   DOI   ScienceOn
21 J. MacQueen, Some methods for classification and analysis of multivariate observation. Proceedings of the fifth Berkeley Symp. on Math. Statist. and Prob., vol.1, pp.281-297, 1966.
22 S. Guha, R. Rastogi and K. Shim, ROCK: a robust clustering algorithm for categorical attributes. Proceedings of the 15th International Conference on Data Engineering, pp.512-521, 1999.
23 Z. Huang, Extensions to the k-means algorithm for clustering large data sets with categorical data. Data Mining and Knowledge Discovery, vol.2, no.3, pp.283-304, 1998.   DOI   ScienceOn
24 L. Kaufman and P. Rousseeuw, Clustering by means of medoids. In Dodge, Y. (Ed.) Statistical Data Analysis based on the L1 Norm. pp.405-416, 1987.
25 Z. He, X. Xu and S. Deng, Squeezer: an efficient algorithm for clustering categorical data. Journal of Computer Science and Technology, vol.17, no.5, pp.611-624, 2002.   DOI   ScienceOn
26 P. H. A. Sneath and R. R. Sokal, Numerical Taxonomy: The Principles and Practice of Numerical Classication, W. H. Freeman and Company, 1973.
27 H. Jiawei and K. Micheline, Data Mining: Concepts and Techniques, 2rd ed., pp.383-444, Morgan Kaufmann, 2006.
28 A. Ahmad and L. Dey, A k-mean clustering algorithm for mixed numeric and categorical data, Data & Knowledge Engineering, vol.63, Issue 2, pp.503-527, 2007.   DOI   ScienceOn
29 C. Stanfill and D. Waltz, Toward memory-based reasoning, Communications of the ACM, vol.29, no.12, pp.1213-1228, 1986.   DOI   ScienceOn