Browse > Article

A Performance Comparison of Cluster Validity Indices based on K-means Algorithm  

Shim, Yo-Sung (고려대학교 산업시스템정보공학과)
Chung, Ji-Won (고려대학교 산업시스템정보공학과)
Choi, In-Chan (고려대학교 산업시스템정보공학과)
Publication Information
Asia pacific journal of information systems / v.16, no.1, 2006 , pp. 127-144 More about this Journal
Abstract
The K-means algorithm is widely used at the initial stage of data analysis in data mining process, partly because of its low time complexity and the simplicity of practical implementation. Cluster validity indices are used along with the algorithm in order to determine the number of clusters as well as the clustering results of datasets. In this paper, we present a performance comparison of sixteen indices, which are selected from forty indices in literature, while considering their applicability to nonhierarchical clustering algorithms. Data sets used in the experiment are generated based on multivariate normal distribution. In particular, four error types including standardization, outlier generation, error perturbation, and noise dimension addition are considered in the comparison. Through the experiment the effects of varying number of points, attributes, and clusters on the performance are analyzed. The result of the simulation experiment shows that Calinski and Harabasz index performs the best through the all datasets and that Davis and Bouldin index becomes a strong competitor as the number of points increases in dataset.
Keywords
Data Mining; Cluster Analysis; Nonhierarchical Clustering; K-means; Cluster Validity Index;
Citations & Related Records
연도 인용수 순위
  • Reference
1 Calinski T. and Harabasz, J., 'A Dendrite Method for Cluster Analysis,'Communications in Statistics, Vol. 3, No. 1, 1974, pp. 1-27
2 Davies D.L. and Bouldin, D.W., 'A Cluster Separation measure,'IEEE Transactions on Pattern analysis and Machine Intelligence, Vol. PAMI 1, No. 2, 1979, pp. 224-227   DOI
3 Dimitriadou, E., Dolnicar, S. and Weingessel, A., 'An Examination of Indexes for Determining the Number of Clusters in Binary Data Sets,'Psychometrika, Vol. 67, No. 1, 2002
4 Edwards, A.W.F. and L. Cavalli Sforza, 'A Method for Cluster Analysis,'Biometrika, Vol. 56, 1965, pp. 362-375
5 Jain, A.K., Murty, M.N. and Flynn, P.J., 'Data Clustering: A Review,'ACM Computing Surveys, Vol. 31, No. 3, 1999
6 Johnson, S.C., 'Hierarchical Clustering Schemes,'Psychometrika, Vol. 32, 1967, pp. 241-254   DOI
7 Lingoes, J.C. and Cooper, T., 'PEP-I: A FORTRAN IV (G) program for Guttman-Lingoes Nonmetric Probability Clustering,'Behavioral Science, Vol. 16, 1971, pp. 259-261
8 Milligan, G.W. and Cooper, M.C., 'An Examination of Procedures for Determining the Number of Clusters in a Data Set,'Psychometrika, Vol. 50, No. 2, 1985, pp. 159-179   DOI
9 Ray, A.A., SAS user's guide: Statistics, Cary, North Carolina: SAS Institue, 1982
10 Scott, A.J. and Symons, M.J., 'Clustering Methods based on Likelihood Ratio Criteria,'Biometrics, Vol. 27, 1971, pp. 387- 397   DOI   ScienceOn
11 Marriot, F.H.B., 'Practical Problems in a Method of Cluster Analysis,'Biometrics, Vol. 27, 1975, pp. 456-460
12 Ball, G.H. and Hall, D.J., 'ISODATA, A Novel Method of Data Analysis and Pattern Classification,'Menlo Park: Stanford Research Institute. (NTIS No. AD 699616), 1965
13 Bock H.H., 'On Tests Concerning the Existence of a Classification,'In First International Symposium on Data Analysis and Informatics, Vol. 2, 1977, pp. 449-464, Rocquencourt, France: IRIA
14 Wolfe, J.H., 'Pattern Clustering by Multivariate Mixture Analysis,'Multivariate Behavioral Research, Vol. 5, 1970, pp. 329-350   DOI
15 Halkidi, M. and Vazirgiannis, M., 'Clustering Validity Assessment: Finding the Optimal Partitioning of a Data Set,'Proceedings IEEE International Conference on Data Mining, 2001, pp. 187-194
16 Ratkowsky, D.A. and Lance, G.N., 'A Criterion for determining the number of groups in a classification,'Australian Computer Journal, Vol. 10, 1978, pp. 115-117
17 Berry, M.J.A. and Linoff, G.S., Mastering Data Mining - The Art and Science of Customer Relationship Management, John Wiley and Sons, Inc. 2000
18 Hubert, L.J. and Levin, J.R., 'A General Statistical Framework for Assessing Categorical Clustering in Free Recall,'Psychological Bulletin, Vol. 83, 1976, pp. 1072- 1080   DOI
19 MacQueen, J.B., 'Some Methods for Classification and Analysis of Multivariate Observations,'Proceedings of 5 th Berkeley Symposium on Mathematical Statistics and Probability, Berkeley, University of California Press, Vol. 1, 1967, pp. 281-297
20 Duda, R.O. and Hart, P.E., Pattern Classification and Scene Analysis, New York: Wiley, 1973
21 Ray, S. and Turi, R.H., 'Determination of Number of Clusters in k-means Clustering and Application in Colour Image Segmentation,'in Proceedings of the 4th International Conference on Advances in Pattern Recognition and Digital Techniques, 1999, pp. 137-143
22 Beale, E.M.L., Cluster Analysis, London: Scientific Control Systems, 1969
23 Bezdek, J.C. and Pal, N.R., 'Some New Indexes of Cluster Validity,'IEEE Transactions on Systems, Man, and Cybernetics-PART B: CYBERNETICS, Vol. 28, No. 3, 1998
24 Sneath, P.H.A., 'A Method for Testing the Distinctness of Clusters: A Test of the Disjunction of two Clusters in Euclidean Space as Measured by their Overlap,'Mathematical Geology, Vol. 9, 1977, pp. 123-143   DOI
25 Frey, T. and Groenewoud, H.V., 'A cluster Analysis of the D-squared Matrix of White Spruce Stands in Saskatchewan based on the Maximum Minimum Principle,'Journal of Ecology, Vol. 60, 1972, pp. 873-886   DOI   ScienceOn
26 Milligan G.W., 'A Monte Carlo Study of Thirty Internal Criterion Measures for Cluster Analysis,'Psychometrika, Vol. 46, 1981, pp. 187-199   DOI
27 Day, N.E., 'Estimating the Components of a Mixture of Normal Distributions,'Biometrika, Vol. 56, 1969, pp. 463-474   DOI   ScienceOn
28 McClain, J.O. and Rao, V.R., 'CLUSTISZ: A Program to Test for the Quality of Clustering of a Set of Objects,'Journal of Marketing Research, Vol. 12, 1975, pp. 456-460
29 Milligan, G.W., 'An Examination of the Effect of six Types of Error Perturbation on Fifteen Clustering Algorithms,'Psychometrika, Vol. 45, No. 3, 1980, pp. 325-342   DOI
30 Mountford, M.D., 'A Test for the Difference between clusters, In G.P. Patil, E.C. Pielou, and W.E. Waters(EDs.),'Statistical Ecology, Vol. 3, 1970, pp. 237-257, University Park, Pa.: Pennsylvania State University Press
31 Mojena, R., 'Hierarchical Grouping Methods and Stopping Rules: An Evaluation,'The Computer Journal, Vol. 20, 1977, pp. 359-363   DOI
32 Hartigan, J.A., Clustering Algorithms, New Work, Wiley, 1975
33 Baker, F.B. and Hubert, L.J., 'Measuring the Power of Hierarchical Cluster Analysis,'Journal of the American Statistical Association, Vol. 70, 1975, pp. 31-38   DOI
34 Forgy, E., 'Cluster Analysis of Multivariate Data: Effciency vs. Interpretability of Classifications,'Biometrics, Vol. 21, 1965, 768
35 Gnanadesikan, R., Kettenring, J.R. and Landwehr, J.M., 'Interpreting and Assessing the Results of Cluster Analyses,'Bulletin of the International Statistical Institute, Vol. 47, 1977, pp. 451-463
36 Rohlf, F.J., 'Methods of Comparing Classifications,'Annual Review of Ecology and Systematics, Vol. 5, 1974, pp. 101-113   DOI
37 Day, W.H.E., Complexity Theory: An Introduction for Practitioners of Classification, Clustering and Classification, P. Arabie and L. Hubert, Eds. World Scientific Publishing Co., Inc., River Edge, NJ., 1992
38 Friedman, H.P. and Rubin, J., 'On Some Invariant Criteria for Grouping Data,'Journal of the American Statistical Association, Vol. 62, 1967, pp. 1159-1178   DOI
39 Milligan, G.W., 'An Algorithm for Generating Artificial Test Clusters,'Psychometrika, Vol. 50, No. 1, 1985, pp. 123-127   DOI
40 Kurita, T., 'An Efficient Agglomerative Clustering Algorithm using a Heap,'Pattern Recognition, Vol. 24, No. 3, 1991, pp. 205-209   DOI   ScienceOn