DOI QR코드

DOI QR Code

Empirical Comparisons of Clustering Algorithms using Silhouette Information

  • Jun, Sung-Hae (Department of Bioinformatics & Statistics, Cheongju University) ;
  • Lee, Seung-Joo (Department of Bioinformatics & Statistics, Cheongju University)
  • Received : 2009.08.30
  • Accepted : 2010.01.10
  • Published : 2010.03.25

Abstract

Many clustering algorithms have been used in diverse fields. When we need to group given data set into clusters, many clustering algorithms based on similarity or distance measures are considered. Most clustering works have been based on hierarchical and non-hierarchical clustering algorithms. Generally, for the clustering works, researchers have used clustering algorithms case by case from these algorithms. Also they have to determine proper clustering methods subjectively by their prior knowledge. In this paper, to solve the subjective problem of clustering we make empirical comparisons of popular clustering algorithms which are hierarchical and non hierarchical techniques using Silhouette measure. We use silhouette information to evaluate the clustering results such as the number of clusters and cluster variance. We verify our comparison study by experimental results using data sets from UCI machine learning repository. Therefore we are able to use efficient and objective clustering algorithms.

Keywords

References

  1. J. Han, M. Kamber, Data Mining Concepts and Techniques, Morgan Kaufmann, 2001.
  2. P.-N. Tan, M. Steinbach, V. Kumar, Introduction to Data Mining, Addison Wesley, 2006.
  3. A. S. Pandya, R. B. Macy, Pattern Recognition with Neural Networks in C++, IEEE Press, 1995.
  4. S. H. Jun, “An Optimal Clustering using Hybrid Self Organizing Map”, International Journal of Fuzzy Logic and Intelligent Systems, vol. 6, no. 1, pp. 10-14, 2006. https://doi.org/10.5391/IJFIS.2006.6.1.010
  5. M. J. Park, S. H. Jun, K. W. Oh, “Determination of Optimal Cluster Size Using Bootstrap and Genetic Algorithm”, International Journal of Fuzzy Logic and Intelligent Systems, vol. 13, no. 1, pp. 12-17, 2003. https://doi.org/10.5391/JKIIS.2003.13.1.012
  6. UCI ML Repository, http://archive.ics.uci.edu/ml/
  7. P. J. Rousseeuw, “Silhouettes: a graphical aid to the interpretation and validation of cluster analysis,” Journal of Computational and Applied mathematics, vol. 20, pp. 53-65, 1987. https://doi.org/10.1016/0377-0427(87)90125-7
  8. B. S. Everitt, S. Landau, M. Leese, Cluster Analysis, Arnold, 2001.
  9. M. Maechler, Cluster Analysis Extended Rousseeuw et al., Package cluster, 2009.
  10. T. M. Mitchell, Machine Learning, McGraw-Hill, 1997.
  11. A. K. Jain, M. N. Murty, P. J. Flynn, “Data clustering: a review,” ACM Computing Surveys, vol. 31, no. 3, pp. 264-323, 1999. https://doi.org/10.1145/331499.331504
  12. D. Dumitrescu, B. Lazzerini, L. C. Jain, Fuzzy Sets and Their Application to Clustering and Training, CRC Press, 2000.
  13. The R Project for Statistical Computing, www.rproject.org
  14. R. Xu, D. Wunsch II, “Survey of clustering algorithms,” IEEE Transactions on Neural Networks, vol. 16, no. 3, pp. 645-678, 2005. https://doi.org/10.1109/TNN.2005.845141
  15. I. Oh, Pattern Recognition, Kyobo, 2008.
  16. R. C. Dubes, “How many clusters are best? - an experiment,” Pattern Recognition, vol. 20, no. 6, pp. 645-663, 1987. https://doi.org/10.1016/0031-3203(87)90034-3
  17. A. R. Liddle, “Information criteria for astrophysical model selection,” Monthly Notices of the Royal Astronomical Society: Letters, vol. 377, iss. 1, pp. L74-L78, 2008.
  18. Q. Zhao, V. Hautamaki, P. Franti, “Knee Point Detection in BIC for Detecting the Number of Clusters,” Lecture Notes in Computer Science, vol. 5259, pp. 664-673, 2008. https://doi.org/10.1007/978-3-540-88458-3_60