지식 분류의 자동화를 위한 클러스터링 모형 연구

Development of a Clustering Model for Automatic Knowledge Classification

  • 정영미 (연세대학교 문헌정보학과) ;
  • 이재윤 (연세대학교 문헌정보학과)
  • 발행 : 2001.06.01

초록

본 연구에서는 문헌을 기반으로 한 지식의 자동분류를 위해 최적의 클러스터링 모형을 제시하고자 하였다. 클러스터링 실험을 위해서 신문기사 실험집단과 학술논문 초록 실험집단을 구축하였고, 분류 성능 평가 척도인 WACS를 개발하였다. 분류자질로 사용한 용어의 집합은 다양한 자질 축소 기준을 적용하여 생성하였으며, 다양한 용어 가중치를 사용하였다. 유사계수 공식으로는 코사인 계수와 자카드 계수를 적용하였으며, 클러스터링 알고리즘으로는 비계층적 기법인 완전연결 기법과 계층적 기법인 K-means기법을 각각 사용하였다. 실험 결과 신문기사 원문 집단에서의 성능이 좋았으며, 완전연결 기법의 성능이 K-means 기법보다 높게 나타났다. 역문헌빈도의 적용은 완전연결 클러스터링에서는 긍정적인 효과가 나타났으나, K-means 클러스터링에서는 그렇지 못했다. 분류자질은 전체의 7.66%만 사용하였을 경우에도 성능 저하가 크지 않았으며, K-means 클러스터링에서는 오히려 성능 향상 효과가 있었다.

The purpose of this study is to develop a document clustering model for automatic classification of knowledge. Two test collections of newspaper article texts and journal article abstracts are built for the clustering experiment. Various feature reduction criteria as well as term weighting methods are applied to the term sets of the test collections, and cosine and Jaccard coefficients are used as similarity measures. The performances of complete linkage and K-means clustering algorithms are compared using different feature selection methods and various term weights. It was found that complete linkage clustering outperforms K-means algorithm and feature reduction up to almost 10% of the total feature sets does not lower the performance of document clustering to any significant extent.

키워드

참고문헌

  1. ACM Computing Classification System Association for Computing Machinery
  2. Proceedings of the Fifteenth Annual ACM SIGIR Conference on Research and Development in Information Retrieval Scatter/Gather: a cluster-based approach to browsing large document collections Cutting,D.R.;Karger,D.R.;Pedersen,J.O.;Tukey,J.W.
  3. Proceedings of the Sixteenth Annual ACM SIGIR Conference on Research and Development in Information Retrieval Constant interaction-time scatter/gather browsing of very large document collections Cutting,D.R.;Karger,D.R.;Pedersen,J.O.
  4. Information Processing & Management v.35 no.1 A comparison of collocation-based similarity measures in query expansion Kim,M.;Choi,K.
  5. Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining Fast and effective text mining using linear-time document clustering Larsen,B.;Aone,C.
  6. Feature Selection for Knowledge Discovery and Data Mining Liu,H.;Motoda,H.
  7. Proceedings of the Third ACM Conference on Digital Libraries SONIA: a service for organizing networked information autonomously Sahami,M.;Yusufali,S.;Baldonado,M.Q.W.
  8. Proceedings of the Twentieth Annual ACM SIGIR Conference on Research and Development in Information Retrieval A comparison of projections for efficient document clustering Schutze,H.;Silverstein,C.
  9. Proceedings of the Twentieth Annual ACM SIGIR Conference on Research and Development in Information Retrieval Almost-constant-time clustering of arbitrary corpus subsets Silverstein,C.;Pedersen,J.
  10. Information Processing & Management v.24 no.5 Term-weighting approaches in automatic text retrieval Salton,G.;Buckley,C.
  11. Scorpion Project Homepage
  12. Proceedings of the Neural Information Processing Systems 1999 Generalized model selection for unsupervised learning in high dimensions Vaithyanathan,S.;Dom,B.
  13. Proceedings of the 16th International Conference on Machine Learning Model selection in unsupervised learning with applications to document clustering Vaithyanathan,S.;Dom,B.
  14. International Classification v.10 no.3 Similarity coefficients and weighting functions for automatic document classification: an empirical comparison Willett,P.
  15. Proceedings of the IEEE 2000 International Conference on Information Society in the 21st Century: Emerging Technologies and New Challenges(IS2000) Incremental document clustering for Web page classification Wong, Wai Chiu;Fu,A.