Browse > Article
http://dx.doi.org/10.6109/jkiice.2014.18.3.625

Document Clustering Technique by K-means Algorithm and PCA  

Kim, Woosaeng (Department of Computer Software, Kwangwoon University)
Kim, Sooyoung (Department of Computer Engineering, Handong Global University)
Abstract
The amount of information is increasing rapidly with the development of the internet and the computer. Since these enormous information is managed by the document forms, it is necessary to search and process them efficiently. The document clustering technique which clusters the related documents through the similarity between the documents help to classify, search, and process the large amount of documents automatically. This paper proposes a method to find the initial seed points through principal component analysis when the documents represented by vectors in the feature vector space are clustered by K-means algorithm in order to increase clustering performance. The experiment shows that our method has a better performance than the traditional K-means algorithm.
Keywords
Document Clustering; K-means algorithm; PCA;
Citations & Related Records
연도 인용수 순위
  • Reference
1 C. Park, Y. Kim, J. Kim, J. Song, and H. Choi, Data Mining using R, Kyowoosa, 2011.
2 H. Park, and K. Lee, Pattern Recognition and Machine Learning from Basic to Application, Leehan Pub., 2011.
3 L. Oh, Pattern Recognition, Kyobo Book Centre, 2010.
4 S. Park, D. An, "Document Clustering Method using PCA and Fuzzy Association," Journal of Korea Information Processing Society B, 2010.
5 C. Lee, M. Kim, K. Lee, G. Lee, H. Park, "Document Thematic words Extraction using Principal Component Analysis," Journal of the Korea Society of Computer and Information B, 2002.
6 C. Lee, M. Kim, J. Paik, H. Park, "Text Summarization using PCA and SVD," Journal of Korea Information Processing Society B, 2003.
7 S. Park, J. Lee, "Topic-basied Multi-document Summarization Using Non-negative Matrix Factorization and K-means," Journal of the Korea Society of Computer and Information B, 2008.
8 S. Park, D. U. An, B. R. Char, and C. W. Kim, "Document Clustering with Cluster Refinement and Non-negative Matrix Factorization," In Proceeding of ICONIP'09, 2009.
9 S. Osinski and D. Weiss, "Conceptua Clustering using lingo algorithm: Evaluation on open directory project data," in Proc. IIPWM04, 2004.
10 The Porter Stemming Algorithm. Available: http://tartarus.org/-martin/PorterStemmer/
11 B. Lee, Information Retrieval, Green Pub. 2012.
12 http://qwone.com/-jason/20Newsgroups/