K-means 알고리즘 기반 클러스터링 인덱스 비교 연구

A Performance Comparison of Cluster Validity Indices based on K-means Algorithm

  • 심요성 (고려대학교 산업시스템정보공학과) ;
  • 정지원 (고려대학교 산업시스템정보공학과) ;
  • 최인찬 (고려대학교 산업시스템정보공학과)
  • 발행 : 2006.03.31

초록

The K-means algorithm is widely used at the initial stage of data analysis in data mining process, partly because of its low time complexity and the simplicity of practical implementation. Cluster validity indices are used along with the algorithm in order to determine the number of clusters as well as the clustering results of datasets. In this paper, we present a performance comparison of sixteen indices, which are selected from forty indices in literature, while considering their applicability to nonhierarchical clustering algorithms. Data sets used in the experiment are generated based on multivariate normal distribution. In particular, four error types including standardization, outlier generation, error perturbation, and noise dimension addition are considered in the comparison. Through the experiment the effects of varying number of points, attributes, and clusters on the performance are analyzed. The result of the simulation experiment shows that Calinski and Harabasz index performs the best through the all datasets and that Davis and Bouldin index becomes a strong competitor as the number of points increases in dataset.

키워드

참고문헌

  1. Baker, F.B. and Hubert, L.J., 'Measuring the Power of Hierarchical Cluster Analysis,'Journal of the American Statistical Association, Vol. 70, 1975, pp. 31-38 https://doi.org/10.2307/2285371
  2. Ball, G.H. and Hall, D.J., 'ISODATA, A Novel Method of Data Analysis and Pattern Classification,'Menlo Park: Stanford Research Institute. (NTIS No. AD 699616), 1965
  3. Beale, E.M.L., Cluster Analysis, London: Scientific Control Systems, 1969
  4. Berry, M.J.A. and Linoff, G.S., Mastering Data Mining - The Art and Science of Customer Relationship Management, John Wiley and Sons, Inc. 2000
  5. Bezdek, J.C. and Pal, N.R., 'Some New Indexes of Cluster Validity,'IEEE Transactions on Systems, Man, and Cybernetics-PART B: CYBERNETICS, Vol. 28, No. 3, 1998
  6. Bock H.H., 'On Tests Concerning the Existence of a Classification,'In First International Symposium on Data Analysis and Informatics, Vol. 2, 1977, pp. 449-464, Rocquencourt, France: IRIA
  7. Calinski T. and Harabasz, J., 'A Dendrite Method for Cluster Analysis,'Communications in Statistics, Vol. 3, No. 1, 1974, pp. 1-27
  8. Davies D.L. and Bouldin, D.W., 'A Cluster Separation measure,'IEEE Transactions on Pattern analysis and Machine Intelligence, Vol. PAMI 1, No. 2, 1979, pp. 224-227 https://doi.org/10.1109/TPAMI.1979.4766909
  9. Day, N.E., 'Estimating the Components of a Mixture of Normal Distributions,'Biometrika, Vol. 56, 1969, pp. 463-474 https://doi.org/10.1093/biomet/56.3.463
  10. Day, W.H.E., Complexity Theory: An Introduction for Practitioners of Classification, Clustering and Classification, P. Arabie and L. Hubert, Eds. World Scientific Publishing Co., Inc., River Edge, NJ., 1992
  11. Dimitriadou, E., Dolnicar, S. and Weingessel, A., 'An Examination of Indexes for Determining the Number of Clusters in Binary Data Sets,'Psychometrika, Vol. 67, No. 1, 2002
  12. Duda, R.O. and Hart, P.E., Pattern Classification and Scene Analysis, New York: Wiley, 1973
  13. Edwards, A.W.F. and L. Cavalli Sforza, 'A Method for Cluster Analysis,'Biometrika, Vol. 56, 1965, pp. 362-375
  14. Forgy, E., 'Cluster Analysis of Multivariate Data: Effciency vs. Interpretability of Classifications,'Biometrics, Vol. 21, 1965, 768
  15. Frey, T. and Groenewoud, H.V., 'A cluster Analysis of the D-squared Matrix of White Spruce Stands in Saskatchewan based on the Maximum Minimum Principle,'Journal of Ecology, Vol. 60, 1972, pp. 873-886 https://doi.org/10.2307/2258571
  16. Friedman, H.P. and Rubin, J., 'On Some Invariant Criteria for Grouping Data,'Journal of the American Statistical Association, Vol. 62, 1967, pp. 1159-1178 https://doi.org/10.2307/2283767
  17. Gnanadesikan, R., Kettenring, J.R. and Landwehr, J.M., 'Interpreting and Assessing the Results of Cluster Analyses,'Bulletin of the International Statistical Institute, Vol. 47, 1977, pp. 451-463
  18. Halkidi, M. and Vazirgiannis, M., 'Clustering Validity Assessment: Finding the Optimal Partitioning of a Data Set,'Proceedings IEEE International Conference on Data Mining, 2001, pp. 187-194
  19. Hartigan, J.A., Clustering Algorithms, New Work, Wiley, 1975
  20. Hubert, L.J. and Levin, J.R., 'A General Statistical Framework for Assessing Categorical Clustering in Free Recall,'Psychological Bulletin, Vol. 83, 1976, pp. 1072- 1080 https://doi.org/10.1037/0033-2909.83.6.1072
  21. Jain, A.K., Murty, M.N. and Flynn, P.J., 'Data Clustering: A Review,'ACM Computing Surveys, Vol. 31, No. 3, 1999
  22. Johnson, S.C., 'Hierarchical Clustering Schemes,'Psychometrika, Vol. 32, 1967, pp. 241-254 https://doi.org/10.1007/BF02289588
  23. Kurita, T., 'An Efficient Agglomerative Clustering Algorithm using a Heap,'Pattern Recognition, Vol. 24, No. 3, 1991, pp. 205-209 https://doi.org/10.1016/0031-3203(91)90062-A
  24. Lingoes, J.C. and Cooper, T., 'PEP-I: A FORTRAN IV (G) program for Guttman-Lingoes Nonmetric Probability Clustering,'Behavioral Science, Vol. 16, 1971, pp. 259-261
  25. MacQueen, J.B., 'Some Methods for Classification and Analysis of Multivariate Observations,'Proceedings of 5 th Berkeley Symposium on Mathematical Statistics and Probability, Berkeley, University of California Press, Vol. 1, 1967, pp. 281-297
  26. Marriot, F.H.B., 'Practical Problems in a Method of Cluster Analysis,'Biometrics, Vol. 27, 1975, pp. 456-460
  27. McClain, J.O. and Rao, V.R., 'CLUSTISZ: A Program to Test for the Quality of Clustering of a Set of Objects,'Journal of Marketing Research, Vol. 12, 1975, pp. 456-460
  28. Milligan, G.W. and Cooper, M.C., 'An Examination of Procedures for Determining the Number of Clusters in a Data Set,'Psychometrika, Vol. 50, No. 2, 1985, pp. 159-179 https://doi.org/10.1007/BF02294245
  29. Milligan G.W., 'A Monte Carlo Study of Thirty Internal Criterion Measures for Cluster Analysis,'Psychometrika, Vol. 46, 1981, pp. 187-199 https://doi.org/10.1007/BF02293899
  30. Milligan, G.W., 'An Algorithm for Generating Artificial Test Clusters,'Psychometrika, Vol. 50, No. 1, 1985, pp. 123-127 https://doi.org/10.1007/BF02294153
  31. Milligan, G.W., 'An Examination of the Effect of six Types of Error Perturbation on Fifteen Clustering Algorithms,'Psychometrika, Vol. 45, No. 3, 1980, pp. 325-342 https://doi.org/10.1007/BF02293907
  32. Mojena, R., 'Hierarchical Grouping Methods and Stopping Rules: An Evaluation,'The Computer Journal, Vol. 20, 1977, pp. 359-363 https://doi.org/10.1093/comjnl/20.4.359
  33. Mountford, M.D., 'A Test for the Difference between clusters, In G.P. Patil, E.C. Pielou, and W.E. Waters(EDs.),'Statistical Ecology, Vol. 3, 1970, pp. 237-257, University Park, Pa.: Pennsylvania State University Press
  34. Ratkowsky, D.A. and Lance, G.N., 'A Criterion for determining the number of groups in a classification,'Australian Computer Journal, Vol. 10, 1978, pp. 115-117
  35. Ray, A.A., SAS user's guide: Statistics, Cary, North Carolina: SAS Institue, 1982
  36. Ray, S. and Turi, R.H., 'Determination of Number of Clusters in k-means Clustering and Application in Colour Image Segmentation,'in Proceedings of the 4th International Conference on Advances in Pattern Recognition and Digital Techniques, 1999, pp. 137-143
  37. Rohlf, F.J., 'Methods of Comparing Classifications,'Annual Review of Ecology and Systematics, Vol. 5, 1974, pp. 101-113 https://doi.org/10.1146/annurev.es.05.110174.000533
  38. Scott, A.J. and Symons, M.J., 'Clustering Methods based on Likelihood Ratio Criteria,'Biometrics, Vol. 27, 1971, pp. 387- 397 https://doi.org/10.2307/2529003
  39. Sneath, P.H.A., 'A Method for Testing the Distinctness of Clusters: A Test of the Disjunction of two Clusters in Euclidean Space as Measured by their Overlap,'Mathematical Geology, Vol. 9, 1977, pp. 123-143 https://doi.org/10.1007/BF02312508
  40. Wolfe, J.H., 'Pattern Clustering by Multivariate Mixture Analysis,'Multivariate Behavioral Research, Vol. 5, 1970, pp. 329-350 https://doi.org/10.1207/s15327906mbr0503_6