Browse > Article

Enhancing Document Clustering Method using Synonym of Cluster Topic and Similarity  

Park, Sun (Institute of Information Science and Engineering Research, Mokpo National University)
Kim, Kyung-Jun (Department of Computer Science, KAIST)
Lee, Jin-Seok (NIPA)
Lee, Seong-Ro (Department of Information and Electronics)
Publication Information
Abstract
This paper proposes a new enhancing document clustering method using a synonym of cluster topic and the similarity. The proposed method can well represent the inherent structure of document cluster set by means of selecting terms of cluster topic based on the semantic features by NMF. It can solve the problem of "bags of words" by using of expanding the terms of cluster topics which uses the synonyms of WordNet. Also, it can improve the quality of document clustering which uses the cosine similarity between the expanded cluster topic terms and document set to well cluster document with respect to the appropriation cluster. The experimental results demonstrate that the proposed method achieves better performance than other document clustering methods.
Keywords
document clustering; NMF, non-negative matrix factorization; semantic features; synonym; cosine similarity;
Citations & Related Records
Times Cited By KSCI : 3  (Citation Analysis)
연도 인용수 순위
1 G. Miller, "WordNet: A lexical database for english", CACM, vol. 38(11), 1995, pp.39-41.   DOI   ScienceOn
2 The 20 newsgroups data set. http://people.csail.mit.edu/jrennie/20Newsgroups/, 2011.
3 W. Xu, X. Liu, Y. Gon, "Document Clustering Based On Non-negative Matrix Factorization", Proceeding of Special Interest Group on Information Retrieval (SIGIR), pp. 267-274, 2003.
4 S. Park, D. U. An, B. R. Char, C. W. Kim, "Document Clustering with Cluster Refinement and Non-negative Matrix Factorization", In proceeding of ICONIP'09, pp. 281-288, 2009.
5 박선, 김철원, "비음수 행렬 분해와 군집의 응집도를 이용한 문서군집", 한국해양정보통신학회 논문지, 제13권 제12호, 2603-2608쪽, 2009년.
6 박선, 김경준, "비음수 행렬 분해와 퍼지 관계를 이용한 문서군집", 한국항행학회 논문지, 제14권 제2호, 239-246쪽, 2010년.
7 S. Basu, A.Banerjee, R. Mooney, "Semi-supervised Clustering by Seeding", Proceeding of International Conference on Machine Learning (ICML), pp. 19-26, 2002.
8 박선, 안동언, "주성분 분석과 퍼지 연관을 이용한 문서군집 방법", 한국정보처리학회 논문지, 제17-B권, 제2호, 177-182쪽, 2010년.
9 한경한, 남경완, "한국어 정보 처리 입문 : 컴퓨터가 우리말을 이해하려면", 커뮤니케이션북스, 2007년.
10 W. B. Frankes, B. Y. Ricardo, "Information Retrieval : Data Structure & Algorithms", Prentice-Hall, 1992.
11 B. Y. Ricardo, R. N. Berthier, "Moden Information Retrieval", ACM Press, 1999.
12 X. Hu, X. Zhang, C. Lu, E. K. Park, X. Zhou, "Exploiting Wikipedia as External Knowledge for Document Clustering," In proceeding of 15th ACM SIGKDD Conference On Knowledge Discover and Data Mining (KDD'09), Paris, Fance, Jun. 2009. pp. 389-396
13 S. Chakrabarti, "mining the web: Discovering Knowledge from Hypertext Data", Morgan Kaufmann Publishers, 2003.
14 J. Han, M. Kamber, "Second Edition Data Mining Concepts and Techniques", Morgan Kaufman, 2006.
15 D. D. Lee, H. S. Seung, "Learning the parts of objects by non-negative matrix factorization," Nature, 401, pp. 788-791, Oct. 1999.   DOI   ScienceOn
16 T. Li, S. Ma, M. Ogihara, "Document Clustering via Adaptive Subspace Iteration", In proceeding of SIGIR'04, pp. 218-225, 2004.
17 F. Wang, C. Zhang, "Regularized Clustering for Documents", In proceeding of ACM SIGIR'07, pp. 95-102, 2007.