DOI QR코드

DOI QR Code

A Study on Optimizing the Number of Clusters using External Cluster Relationship Criterion

외부 군집 연관 기준 정보를 이용한 군집수 최적화

  • 이현진 (한국사이버대학교 컴퓨터정보통신학과) ;
  • 지태창 (연세대학교 컴퓨터과학과)
  • Received : 2011.09.02
  • Accepted : 2011.09.16
  • Published : 2011.09.30

Abstract

The k-means has been one of the popular, simple and faster clustering algorithms, but the right value of k is unknown. The value of k (the number of clusters) is a very important element because the result of clustering is different depending on it. In this paper, we present a novel algorithm based on an external cluster relationship criterion which is an evaluation metric of clustering result to determine the number of clusters dynamically. Experimental results show that our algorithm is superior to other methods in terms of the accuracy of the number of clusters.

군집화는 주어진 데이터를 분할하여 데이터 속에 숨겨져 있는 의미를 자동으로 발견하는 방법이다. k-means는 간단하고 빠른 군집화 알고리즘 중의 하나이다. 군집의 수 k는 군집화를 수행하는데 매우 중요한 요소이며, k의 값에 의해 군집화 결과가 달라진다. 본 논문에서는 반복적인 k-means 수행과 군집의 품질을 평가하는 외부 군집 연관 기준 정보를 결합하여 최적의 군집수를 결정하는 방법을 제안한다. 실험 결과 기존의 방법들에 비하여 제안하는 방법이 군집수의 정확성 측면에서 우수한 성능을 보였다.

Keywords

References

  1. R. O. Duda, P. E. Hart and Da. G. Stork, "Pattern Classification (2nd Edition)", Wiley-Interscience, Oct., 2000.
  2. J. Vesanto, J. Himberg, E. Alhoniemi and J. Parkankangas, "Self-Organizing Map in Matlab: the SOM Toolbox", Proceedings of the Matlab DSP Conference, pp. 34-40, 1999.
  3. M. H. Yang and N. Ahuja, "A Data Partition Method for Parallel Self-Organizing Map", Proceeding of the IJCNN 99, pp. 1929-1933, 1999.
  4. Z. Huang, "Extension to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values", Data Mining and Knowledge Discovery, Vol 2, pp. 283-304, 1998. https://doi.org/10.1023/A:1009769707641
  5. D. Pelleg and A. Moore, "Accelerating Exact K-means Algorithms with Geometric Reasoning", International Conference on Knowledge Discovery and Datamining '99, pp. 277-281, 1999.
  6. A. K. Jain, "Data clustering: 50 years beyond K-means", Pattern Recognition Letters, Vol. 31, pp. 651-666, 2010. https://doi.org/10.1016/j.patrec.2009.09.011
  7. M. Figueiredo and A. K. Jain, "Unsupervised learning of finite mixture models", IEEE transactions on pattern analysis and machine intelligence, Vol. 24, pp. 381-396, 2002. https://doi.org/10.1109/34.990138
  8. R. Tibshirani, G. Walther and T. Hastie, "Estimating the number of clusters in a data set via the gap statistic", Journal of the royal statistical society, Vol. 63, pp. 411-423, 2001. https://doi.org/10.1111/1467-9868.00293
  9. C. Rasmussen, "The infinite gaussian mixture model", Advances in neural information processing systems, Vol. 12, pp. 554-560, 2000.
  10. D. Pelleg and A. Moore, "X-means: Extending k-means with efficient estimation of the number of clusters", In Proc. of the Seventeenth International Conference on Machine Learning (ICML2000), June, pp. 727-734, 2000.
  11. S. Salvador and P. Chan, "Determining the number of clusters/segments in hierarchical clustering/segmentation algorithms", In Proc. of the 16th IEEE International Conference on Tools with Artificial Intelligence, Nov., pp. 576-584, 2004.
  12. W. Lu and I. Traore, "Determining the optimal number of clusters using a new evolutionary algorithm", In Proc. Of the 17th IEEE International Conference on Tools with Artificial Intelligence(ICTAI 05), Nov., 2 pp., 2005.
  13. B. Boutsinas, D. K. Tasoulis and M. N. Vrahatis, "Estimating the number of clusters using a windowing technique", Journal of Pattern Recognition an Image Analysis, Vol. 16, No. 2, April, pp. 143-154, 2006. https://doi.org/10.1134/S1054661806020015
  14. 지태창, 이현진, 이일병, "온라인 문서 군집화에서 군집 수 결정 방법", 정보처리학회지, Vol. 117, pp. 513-522, 2007.
  15. O. Satoshi and T. Katsumi, "How Many Objects?: Determining the Number of Clusters with a Skewed Distribution", Proceeding of the 18th European Conference on Artificial Intelligence, pp. 771-772, 2008.
  16. R. V. Ranga, "Incremental Clustering Algorithm for Earth Science Data Mining", Proceeding of the 9th International Conference on Computational Science, pp. 375-384, 2009.
  17. A. J. Graaff and A. P. Engelbrecht, "Using sequential deviation to dynamically determine the number of clusters found by a local network neighbourhood artificial immune system", Journal of Applied Soft Computing archive, Vol. 11, pp. 2698-2713, 2011. https://doi.org/10.1016/j.asoc.2010.10.017
  18. Earl Gose, Richard Johnsonbugh and Steve Jost, "Pattern Recognition and Image Analysis", Prentice Hall, 1996.
  19. Y. Yang, "Can the strength of AIC and BIC be shared?", Biometrika, Vol. 92, pp. 937-950, 2005. https://doi.org/10.1093/biomet/92.4.937
  20. D. D. Lewis, "Reuters-21578 text categorization test collection distribution 1.0", http://www.research.att.com/-lewis, 1999.
  21. S. Hettich and S. D. Bay, "The UCI KDD Archive [http://kdd.ics.uci.edu]", Irvine, CA: University of California, Department of Information and Computer Science, 1999.

Cited by

  1. 소셜 네트워크 분석을 위한 동적 하위 그룹 생성 vol.14, pp.1, 2011, https://doi.org/10.9728/dcs.2013.14.1.41
  2. 차세대 융합형 콘텐츠 산업 분류체계에 관한 연구 vol.14, pp.1, 2011, https://doi.org/10.9728/dcs.2013.14.1.97