경계변수 값의 동적인 변경을 이용한 점층적 클러스터링 알고리즘

Incremental Clustering Algorithm by Modulating Vigilance Parameter Dynamically

  • 신광철 (중앙대학교 컴퓨터공학부) ;
  • 한상용 (중앙대학교 컴퓨터공학부)
  • 발행 : 2003.12.01

초록

본 논문은 점층적으로 대규모 문서 분류를 할 수 있는 새로운 클러스터링 알고리즘에 대한 것으로, 고차원의 대규모 문서 집합에 대한 클러스터링을 수행하는 spherical k-means (SKM) 알고리즘과 점층적인 방식으로 클러스터링을 수행하는 퍼지(fuzzy) ART(adaptive resonance theory) 신경망의 특징을 이용하였다. 즉, SKM의 벡터 공간 모델과 개념벡터를 토대로 퍼지 ART의 경계변수의 개념을 결합한 것이다. 제시하는 알고리즘은 점층적 클러스터링의 지원과 함께 최적의 클러스터 수를 자동으로 결정할 뿐 아니라 이상치(outlier)와 노이즈(noise)에 의한 overfitting의 문제도 해결하였다. 또한 생성된 클러스터들의 질을 평가할 수 있는 응집도를 측정하는 목적 함수의 값에 있어서도 CLASSIC3 데이타 집합으로 실험한 결과 기존의 SKM에 비해 평균 8.04%의 향상된 응집도를 나타냈다.

This study is purported for suggesting a new clustering algorithm that enables incremental categorization of numerous documents. The suggested algorithm adopts the natures of the spherical k-means algorithm, which clusters a mass amount of high-dimensional documents, and the fuzzy ART(adaptive resonance theory) neural network, which performs clustering incrementally. In short, the suggested algorithm is a combination of the spherical k-means vector space model and concept vector and fuzzy ART vigilance parameter. The new algorithm not only supports incremental clustering and automatically sets the appropriate number of clusters, but also solves the current problems of overfitting caused by outlier and noise. Additionally, concerning the objective function value, which measures the cluster's coherence that is used to evaluate the quality of produced clusters, tests on the CLASSIC3 data set showed that the newly suggested algorithm works better than the spherical k-means by 8.04% in average.

키워드

참고문헌

  1. Duda R. O. and Hart P. E., 'Pattern Classification and Scene Analysis,' Wiley, 1973
  2. Mitchell T., 'Machine Learning,' McGraw Hill, 1997
  3. Zamir O. and Etzioni O., 'Web Document Clustering: A Feasibility Demonstration,' Proceedings of the 19th International ACM SIGIR Conference on Research and Development in Information Retrieval(SIGIR '98), pp.46-54, 1988
  4. Zamir O. and Etzioni O., 'Grouper : A Dynamic Clustering Interface to Web Search Results,' Computer Networks Journal, Vol.31, pp.1361-1374, 1999 https://doi.org/10.1016/S1389-1286(99)00054-7
  5. Modha D. S. and Spangler W. S., 'Clustering Hypertext with Applications to Web Searching,' Proceedings of ACM Hypertext Conference, 2000 https://doi.org/10.1145/336296.336351
  6. Leouski A. and Croft W. B., 'An Evaluation of Techniques for Clustering Search Results,' Technical Report IR-76, University of Massachusetts at Amherst, 1996
  7. Hearst M. A. and Pedersen J. O., 'Reexamining the Cluster Hypothesis : Scatter/Gather on Retrieval Results,' Proceedings of ACM SIGIR'96, pp.76-84, 1996 https://doi.org/10.1145/243199.243216
  8. 임영희, '후처리 웹 문서 클러스터링 알고리즘', 정보처리학회 논문지, 제9-B권, 제1호, pp.7-16, 2002 https://doi.org/10.3745/KIPSTB.2002.9B.1.007
  9. Dhillon I. S. and Modha, D. S. 'Concept Decomposition for Large Sparse Text Data using Clustering,' Technical Report RJ 10147(9502), IBM Almaden Research Center, 1999
  10. Salton G. and. McGill M. J., 'Introduction to Modern Retrieval.' McGraw-Hill Book Company, 1983
  11. Carpenter G. A., Grossberg S. and Rosen D. B., 'Fuzzy ART : An Adaptive Resonance Algorithm for Rapid, Stable Classification of Analog Patterns,' Proceedings of 1991 International Conference Neural Networks, Vol.II, pp.411-416, 1991
  12. Frakes W. B. and Baeza-Yates R., 'Information Retrieval : Data Structures and Algorithms,' Prentince Hall, Englewood Cliffs, New Jersey, 1992
  13. Salton G., and Buckley C., 'Term-weighting approaches in automatic text retrieval,' Information Processing & Management, 4(5):513:523, 1988
  14. Kolda T. G. and O'Leary D. P., 'A Semi-Discrete Matrix Decomposition for Latent Semantic Indexing in Information Retrieval,' ACM Transactions on Information Systems, 16, 322-346. 1998 https://doi.org/10.1145/291128.291131
  15. Dhillon I. S., Fan J., and Guan Y., 'Efficient Clustering of Very Large Document Collections' Data Mining for Scientific and Engineering Applications, Kluwer Academic Publishers, 200l. available at http://www.cs.utexas.edu/users/jfan/dm/
  16. Available at http://www.cs.utexas.edu/users/inderjit/Resources/sparse_matrices