[KSCI] Korea Science Citation Index Service

http://dx.doi.org/10.3745/KIPSTD.2003.10D.1.057

An Effective Incremental Text Clustering Method for the Large Document Database

Kang, Dong-Hyuk ((주)네트빌 부설연구소)
Joo, Kil-Hong (연세대학교 대학원 컴퓨터과학과)
Lee, Won-Suk (연세대학교 컴퓨터과학과)

Publication Information

The KIPS Transactions:PartD / v.10D, no.1, 2003 , pp. 57-66 More about this Journal

Abstract

With the development of the internet and computer, the amount of information through the internet is increasing rapidly and it is managed in document form. For this reason, the research into the method to manage for a large amount of document in an effective way is necessary. The document clustering is integrated documents to subject by classifying a set of documents through their similarity among them. Accordingly, the document clustering can be used in exploring and searching a document and it can increased accuracy of search. This paper proposes an efficient incremental cluttering method for a set of documents increase gradually. The incremental document clustering algorithm assigns a set of new documents to the legacy clusters which have been identified in advance. In addition, to improve the correctness of the clustering, removing the stop words can be proposed and the weight of the word can be calculated by the proposed TF $\times$ NIDF function.

Keywords

Document Clustering Method; Incremental Clustering; Stop Word Extraction;

Citations & Related Records

Times Cited By KSCI : 1 (Citation Analysis)

Reference
Cited By KSCI

1	Yiming Yang, 'Expert Network : Effective and efficient learning from human decisions in text categorization and retrieval,' 17th ACM SIGIR Conference on Research and Development in Information Retrieval, pp.13-22, 1994
2	Ron Fagin, Yoelle Maarek, Israel Ben-Shaul, and Dan Pel-leg, 'Ephemeral document clustering for web applications,' IBM Research Report RJ 10186, April, 2000
3	Eui-Hong (Sam) Han, George Karypis, and Vipin Kumar, 'Text Categorization Using Weight Adjusted k-Nearest Neighbor Classification,' 5th Pacific Asia Conference on Knowledge Discovery And Data Mining, 2001
4	Amit Singhal, Chris Buckley, and Mandar Mitra, 'Pivoted Document Length Normalization,' Proceedings of 19th ACM International Conference on Research and Development in Information Retrieval, 1996 DOI
5	M. Ester, H. Kriegel, J. Sander, M. Wimmer, and X. Xu, 'Incremental Clustering for Mining in a Data Warehousing Environment,' Proceedings of the 24th VLDB Conference, New York, USA, 1998
6	Futamura Shoukchi and Matsuo Fumihiro, 'Automatic In-dexing by Stop Word Removal on Scientific and Technical Documents Written in English,' Information Processing Society of Japan, Vol.28 No.07, 1987
7	G. Salton, 'Automatic Text Processing,' Addison-Welsley Publishing Company, 1989
8	W. E. L. Grimson and D. P. Huttenlocher, 'On the sensi-tivity of geometric hashing,' 3rd International Conference on Computer Vision, pp.334-338, 1990 DOI
9	Weifeng Li, Baowen Xu, Cheng-Cheng Chu, Chih-Wei Lu, 'Application of Genetic Algorithm in Search Engine,' Pro-ceedings of International Symposium on Multimedia Soft-ware Engineering, pp.366-371, 2000 DOI
10	I. Aalbersberg, 'A Document Retrieval Model Based on Term Frequency Ranks,' 17th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp.163-172, 1994
11	강승식, 'HAM : 한국어 분석 모듈', http://nlp.kookmin.ac.kr.
12	야후!코리아 뉴스, http://kr.dailynews.yahoo.com/
13	G. Salton, C. Buckley, 'Term-weighting approaches in au-tomatic text retrieval,' Information Processing and Mana-gement, Vol.24, No.5, pp.513-523, 1988 DOI ScienceOn
14	C. J. Van Rijsvergen, 'Information Retrieval,' Butterworth, London, 2nd edition, 1979
15	Douglass, R. Cutting, David, R. Karger, Jao, O. Pedersen, and John, W. Tukey, 'Scatter/Gather : A Cluster-based Approach to Browsing Large Document Collections,' 15th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp.318-329, 1992 DOI
16	B. W. Frakes and R. Baeza-Yates, 'Information Retrieval : Data Structures & Algorithms,' Prentice Hall, 1992
17	J. J. Rocchio, 'Document Retrieval Systems - Optimization and Evaluation,' Ph. D. Thesis, Havard University, 1966
18	David D. Lewis, Robert E. Schapire, James P.Callan, Ron Papka, 'Training Algorithms for Linear Text Classifiers,' Proceedings of 19th ACM International Conference on Research and Development in Information Retrieval, 1996 DOI
19	'야후!', http://www.yahoo.com/
20	Jain, A. K. and Dubes, R. C, 'Algorithms for Clustering Data,' Prentice Hall, 1988

KSCI

An Effective Incremental Text Clustering Method for the Large Document Database 대용량 문서 데이터베이스를 위한 효율적인 점진적 문서 클러스터링 기법

An Effective Incremental Text Clustering Method for the Large Document Database