Browse > Article
http://dx.doi.org/10.3745/KIPSTB.2007.14-B.7.513

Determining the number of Clusters in On-Line Document Clustering Algorithm  

Jee, Tae-Chang (연세대학교 컴퓨터과학과)
Lee, Hyun-Jin (한국싸이버대학교 컴퓨터정보통신학부)
Lee, Yill-Byung (연세대학교 컴퓨터과학과)
Abstract
Clustering is to divide given data and automatically find out the hidden meanings in the data. It analyzes data, which are difficult for people to check in detail, and then, makes several clusters consisting of data with similar characteristics. On-Line Document Clustering System, which makes a group of similar documents by use of results of the search engine, is aimed to increase the convenience of information retrieval area. Document clustering is automatically done without human interference, and the number of clusters, which affect the result of clustering, should be decided automatically too. Also, the one of the characteristics of an on-line system is guarantying fast response time. This paper proposed a method of determining the number of clusters automatically by geometrical information. The proposed method composed of two stages. In the first stage, centers of clusters are projected on the low-dimensional plane, and in the second stage, clusters are combined by use of distance of centers of clusters in the low-dimensional plane. As a result of experimenting this method with real data, it was found that clustering performance became better and the response time is suitable to on-line circumstance.
Keywords
On-Line Document; Document Clustering; Optimizing the Number of Clusters; Determining the Number of Clusters; K-Means Algorithm; Multi-Dimensional Scaling;
Citations & Related Records
연도 인용수 순위
  • Reference
1 J. B. Tenenbaum, V. de Silva and J. C. Langford, 'A Global Geometric Framework for Nonlinear Dimensionality Reduction', SCIENCE, Vol. 290, Dec., pp. 2319-2323, 2000   DOI   ScienceOn
2 H. Yu, 'Automatically Determining Number of Clusters', Information Retrieval (CMU CS11-741) Final Report, Apr., 5 pp., 1998
3 J. He, M. Lan, C.L. Tan, S.Y. Sung and H.B. Low, 'Initialization of clusters refinement algorithms: a review and comparative study,' International Joint Conference on Neural Networks 2004, pp. 25-29, 2004
4 A. K. Jain and R. C. Dubes, 'Algorithms for Clustering Data', Prentice Hall, 1988
5 L. Kaufman and P. J. Rousseuw, 'Finding Groups in Data an Introduction to Cluster Analysis', Wiley Series in Probability and Mathematical Statistics, 1990
6 D. D. Lewis, 'Reuters-21578 text categorization test collection distribution 1.0', http://www.research.att.com/ ∼lewis, 1999
7 D. Pelleg and A. Moore, 'X-means: Extending k-means with efficient estimation of the number of clusters', In Proc. of the Seventeenth International Conference on Machine Learning (ICML2000), June, pp. 727-734, 2000
8 E. Rasmussen, 'Clustering algorithms', In W.B. Frakes and R. Baeza-Yates, eds. Information Retrieval . Prentice Hall, 1992
9 S. Salvador and P. Chan, 'Determining the number of clusters/segments in hierarchical clustering/segmentation algorithms', In Proc. of the 16th IEEE International Conference on Tools with Artificial Intelligence, Nov., pp. 576-584, 2004   DOI
10 E. Gose, R. Johnsonbugh and S. Jost, 'Pattern Recognition and Image Analysis', Prentice Hall, 1996
11 J. He, A.H. Tan, C.L. Tan, and S.Y. Sung, 'On quantitative evaluation of clustering systems', In Weili We, Hui Xiong, and Shashi Shekhar, editors, Information Retrieval and Clustering. Kluwer Academic Publishers, 2003
12 W. Lu and I. Traore, 'Determining the optimal number of clusters using a new evolutionary algorithm', In Proc. Of the 17th IEEE International Conference on Tools with Artificial Intelligence(ICTAI 05), Nov., 2 pp., 2005   DOI
13 D.R. Cutting, D.R. Karger, J. O. Pedersen and J. W. Tukey, 'Scatter/Gather: a cluster-based approach to browsing large document collections', In Proc. of the 15th annual international ACM SIGIR, June, pp. 318-329, 1992
14 R. O. Duda, P. E. Hart and Da. G. Stork, 'Pattern Classification (2nd Edition) ', Wiley-Interscience, Oct., 2000
15 B. Boutsinas, D. K. Tasoulis and M. N. Vrahatis, 'Estimating the number of clusters using a windowing technique', Journal of Pattern Recognition an Image Analysis, Vol. 16, No. 2, April, pp. 143-154, 2006   DOI
16 C.G. Li, J. Guo, G. Chen, X.F. Nie and Z. Yang, 'A Version of ISOMAP with Explicit Mapping', In Proc. of Fifth International Conference on Machine Learning and Cybernetics, Dalian, 13-16 Aug., pp.3201-3206, 2006
17 A. Liu and Y. Gong, 'Document clustering with cluster refinement and model selection capabilities', In Proc. of ACM SIGIR 2002, Tampere, Finland, Aug, pp. 191-198, 2002
18 H. Motulsky, 'Intuitive Biostatistics', Oxford University Press, 1995
19 지태창, 이현진, 이일병, '차원축소를 통한 온라인 문서분류 시스템', 한국데이터마이닝학회 2005 추계학술대회, pp. 197-206, 2005
20 장익진, '다차원 척도 분석법', 연암사, 1998
21 M. J. A. Berry and G. S. Linoff, 'Data Mining Techniques for Marketing, Sales, and Customer Support', John Wiley & Sons, 1997
22 I. Borg, P. J. F. Groenen and S. P. Borgatti, 'Modern Multidimensional Scaling', Springer Verlag, 2005