[KSCI] Korea Science Citation Index Service

A Performance Comparison of Cluster Validity Indices based on K-means Algorithm

Shim, Yo-Sung (고려대학교 산업시스템정보공학과)
Chung, Ji-Won (고려대학교 산업시스템정보공학과)
Choi, In-Chan (고려대학교 산업시스템정보공학과)

Publication Information

Asia pacific journal of information systems / v.16, no.1, 2006 , pp. 127-144 More about this Journal

Abstract

The K-means algorithm is widely used at the initial stage of data analysis in data mining process, partly because of its low time complexity and the simplicity of practical implementation. Cluster validity indices are used along with the algorithm in order to determine the number of clusters as well as the clustering results of datasets. In this paper, we present a performance comparison of sixteen indices, which are selected from forty indices in literature, while considering their applicability to nonhierarchical clustering algorithms. Data sets used in the experiment are generated based on multivariate normal distribution. In particular, four error types including standardization, outlier generation, error perturbation, and noise dimension addition are considered in the comparison. Through the experiment the effects of varying number of points, attributes, and clusters on the performance are analyzed. The result of the simulation experiment shows that Calinski and Harabasz index performs the best through the all datasets and that Davis and Bouldin index becomes a strong competitor as the number of points increases in dataset.

Keywords

Data Mining; Cluster Analysis; Nonhierarchical Clustering; K-means; Cluster Validity Index;

Citations & Related Records

Reference

1	Calinski T. and Harabasz, J., 'A Dendrite Method for Cluster Analysis,'Communications in Statistics, Vol. 3, No. 1, 1974, pp. 1-27
2	Davies D.L. and Bouldin, D.W., 'A Cluster Separation measure,'IEEE Transactions on Pattern analysis and Machine Intelligence, Vol. PAMI 1, No. 2, 1979, pp. 224-227 DOI
3	Dimitriadou, E., Dolnicar, S. and Weingessel, A., 'An Examination of Indexes for Determining the Number of Clusters in Binary Data Sets,'Psychometrika, Vol. 67, No. 1, 2002
4	Edwards, A.W.F. and L. Cavalli Sforza, 'A Method for Cluster Analysis,'Biometrika, Vol. 56, 1965, pp. 362-375
5	Jain, A.K., Murty, M.N. and Flynn, P.J., 'Data Clustering: A Review,'ACM Computing Surveys, Vol. 31, No. 3, 1999
6	Johnson, S.C., 'Hierarchical Clustering Schemes,'Psychometrika, Vol. 32, 1967, pp. 241-254 DOI
7	Lingoes, J.C. and Cooper, T., 'PEP-I: A FORTRAN IV (G) program for Guttman-Lingoes Nonmetric Probability Clustering,'Behavioral Science, Vol. 16, 1971, pp. 259-261
8	Milligan, G.W. and Cooper, M.C., 'An Examination of Procedures for Determining the Number of Clusters in a Data Set,'Psychometrika, Vol. 50, No. 2, 1985, pp. 159-179 DOI
9	Ray, A.A., SAS user's guide: Statistics, Cary, North Carolina: SAS Institue, 1982
10	Scott, A.J. and Symons, M.J., 'Clustering Methods based on Likelihood Ratio Criteria,'Biometrics, Vol. 27, 1971, pp. 387- 397 DOI ScienceOn
11	Marriot, F.H.B., 'Practical Problems in a Method of Cluster Analysis,'Biometrics, Vol. 27, 1975, pp. 456-460
12	Ball, G.H. and Hall, D.J., 'ISODATA, A Novel Method of Data Analysis and Pattern Classification,'Menlo Park: Stanford Research Institute. (NTIS No. AD 699616), 1965
13	Bock H.H., 'On Tests Concerning the Existence of a Classification,'In First International Symposium on Data Analysis and Informatics, Vol. 2, 1977, pp. 449-464, Rocquencourt, France: IRIA
14	Wolfe, J.H., 'Pattern Clustering by Multivariate Mixture Analysis,'Multivariate Behavioral Research, Vol. 5, 1970, pp. 329-350 DOI
15	Halkidi, M. and Vazirgiannis, M., 'Clustering Validity Assessment: Finding the Optimal Partitioning of a Data Set,'Proceedings IEEE International Conference on Data Mining, 2001, pp. 187-194
16	Ratkowsky, D.A. and Lance, G.N., 'A Criterion for determining the number of groups in a classification,'Australian Computer Journal, Vol. 10, 1978, pp. 115-117
17	Berry, M.J.A. and Linoff, G.S., Mastering Data Mining - The Art and Science of Customer Relationship Management, John Wiley and Sons, Inc. 2000
18	Hubert, L.J. and Levin, J.R., 'A General Statistical Framework for Assessing Categorical Clustering in Free Recall,'Psychological Bulletin, Vol. 83, 1976, pp. 1072- 1080 DOI
19	MacQueen, J.B., 'Some Methods for Classification and Analysis of Multivariate Observations,'Proceedings of 5 th Berkeley Symposium on Mathematical Statistics and Probability, Berkeley, University of California Press, Vol. 1, 1967, pp. 281-297
20	Duda, R.O. and Hart, P.E., Pattern Classification and Scene Analysis, New York: Wiley, 1973
21	Ray, S. and Turi, R.H., 'Determination of Number of Clusters in k-means Clustering and Application in Colour Image Segmentation,'in Proceedings of the 4th International Conference on Advances in Pattern Recognition and Digital Techniques, 1999, pp. 137-143
22	Beale, E.M.L., Cluster Analysis, London: Scientific Control Systems, 1969
23	Bezdek, J.C. and Pal, N.R., 'Some New Indexes of Cluster Validity,'IEEE Transactions on Systems, Man, and Cybernetics-PART B: CYBERNETICS, Vol. 28, No. 3, 1998
24	Sneath, P.H.A., 'A Method for Testing the Distinctness of Clusters: A Test of the Disjunction of two Clusters in Euclidean Space as Measured by their Overlap,'Mathematical Geology, Vol. 9, 1977, pp. 123-143 DOI
25	Frey, T. and Groenewoud, H.V., 'A cluster Analysis of the D-squared Matrix of White Spruce Stands in Saskatchewan based on the Maximum Minimum Principle,'Journal of Ecology, Vol. 60, 1972, pp. 873-886 DOI ScienceOn
26	Milligan G.W., 'A Monte Carlo Study of Thirty Internal Criterion Measures for Cluster Analysis,'Psychometrika, Vol. 46, 1981, pp. 187-199 DOI
27	Day, N.E., 'Estimating the Components of a Mixture of Normal Distributions,'Biometrika, Vol. 56, 1969, pp. 463-474 DOI ScienceOn
28	McClain, J.O. and Rao, V.R., 'CLUSTISZ: A Program to Test for the Quality of Clustering of a Set of Objects,'Journal of Marketing Research, Vol. 12, 1975, pp. 456-460
29	Milligan, G.W., 'An Examination of the Effect of six Types of Error Perturbation on Fifteen Clustering Algorithms,'Psychometrika, Vol. 45, No. 3, 1980, pp. 325-342 DOI
30	Mountford, M.D., 'A Test for the Difference between clusters, In G.P. Patil, E.C. Pielou, and W.E. Waters(EDs.),'Statistical Ecology, Vol. 3, 1970, pp. 237-257, University Park, Pa.: Pennsylvania State University Press
31	Mojena, R., 'Hierarchical Grouping Methods and Stopping Rules: An Evaluation,'The Computer Journal, Vol. 20, 1977, pp. 359-363 DOI
32	Hartigan, J.A., Clustering Algorithms, New Work, Wiley, 1975
33	Baker, F.B. and Hubert, L.J., 'Measuring the Power of Hierarchical Cluster Analysis,'Journal of the American Statistical Association, Vol. 70, 1975, pp. 31-38 DOI
34	Forgy, E., 'Cluster Analysis of Multivariate Data: Effciency vs. Interpretability of Classifications,'Biometrics, Vol. 21, 1965, 768
35	Gnanadesikan, R., Kettenring, J.R. and Landwehr, J.M., 'Interpreting and Assessing the Results of Cluster Analyses,'Bulletin of the International Statistical Institute, Vol. 47, 1977, pp. 451-463
36	Rohlf, F.J., 'Methods of Comparing Classifications,'Annual Review of Ecology and Systematics, Vol. 5, 1974, pp. 101-113 DOI
37	Day, W.H.E., Complexity Theory: An Introduction for Practitioners of Classification, Clustering and Classification, P. Arabie and L. Hubert, Eds. World Scientific Publishing Co., Inc., River Edge, NJ., 1992
38	Friedman, H.P. and Rubin, J., 'On Some Invariant Criteria for Grouping Data,'Journal of the American Statistical Association, Vol. 62, 1967, pp. 1159-1178 DOI
39	Milligan, G.W., 'An Algorithm for Generating Artificial Test Clusters,'Psychometrika, Vol. 50, No. 1, 1985, pp. 123-127 DOI
40	Kurita, T., 'An Efficient Agglomerative Clustering Algorithm using a Heap,'Pattern Recognition, Vol. 24, No. 3, 1991, pp. 205-209 DOI ScienceOn

KSCI

A Performance Comparison of Cluster Validity Indices based on K-means Algorithm K-means 알고리즘 기반 클러스터링 인덱스 비교 연구

A Performance Comparison of Cluster Validity Indices based on K-means Algorithm