Document Clustering Technique by K-means Algorithm and PCA

Kim, Woosaeng;Kim, Sooyoung;

doi:10.6109/jkiice.2014.18.3.625

Journal of the Korea Institute of Information and Communication Engineering (한국정보통신학회논문지)

Volume 18 Issue 3
/
Pages.625-630
/
2014
/
2234-4772(pISSN)
/
2288-4165(eISSN)

The Korea Institute of Information and Commucation Engineering (한국정보통신학회)

DOI QR Code

Document Clustering Technique by K-means Algorithm and PCA

주성분 분석과 k 평균 알고리즘을 이용한 문서군집 방법

Kim, Woosaeng (Department of Computer Software, Kwangwoon University) ;
Kim, Sooyoung (Department of Computer Engineering, Handong Global University)

김우생 ;
김수영

Received : 2013.09.03
Accepted : 2013.10.31
Published : 2014.03.31

https://doi.org/10.6109/jkiice.2014.18.3.625 Citation PDF KSCI

Download PDF

⟨ Previous Next ⟩

Abstract

The amount of information is increasing rapidly with the development of the internet and the computer. Since these enormous information is managed by the document forms, it is necessary to search and process them efficiently. The document clustering technique which clusters the related documents through the similarity between the documents help to classify, search, and process the large amount of documents automatically. This paper proposes a method to find the initial seed points through principal component analysis when the documents represented by vectors in the feature vector space are clustered by K-means algorithm in order to increase clustering performance. The experiment shows that our method has a better performance than the traditional K-means algorithm.

컴퓨터의 발전과 인터넷의 급속한 발전으로 정보의 양이 폭발적으로 증가하게 되었고 이러한 방대한 양의 정보들은 대부분 문서 형태로 관리되기 때문에, 이들을 효과적으로 검색하고 처리하는 방법의 연구가 필요하다. 문서 군집은 문서간의 유사도를 바탕으로 서로 연관된 문서들을 군집화하여 대용량의 문서들을 자동으로 분류하고 검색하고 처리하는데 효율과 정확성을 증대시킨다. 본 논문은 특징 벡터 공간 상의 벡터들로 표현되는 문서들을 K 평균 알고리즘으로 군집화할 때, 주성분 분석을 사용하여 초기 시드점들을 선정함으로써 군집의 효율을 높이는 방법을 제안한다. 실험 결과를 통하여 제안하는 기법이 기존의 K 평균 알고리즘보다 좋은 결과를 얻을 수 있음을 보였다.

Keywords

References

C. Park, Y. Kim, J. Kim, J. Song, and H. Choi, Data Mining using R, Kyowoosa, 2011.
H. Park, and K. Lee, Pattern Recognition and Machine Learning from Basic to Application, Leehan Pub., 2011.
L. Oh, Pattern Recognition, Kyobo Book Centre, 2010.
S. Park, D. An, "Document Clustering Method using PCA and Fuzzy Association," Journal of Korea Information Processing Society B, 2010.
C. Lee, M. Kim, K. Lee, G. Lee, H. Park, "Document Thematic words Extraction using Principal Component Analysis," Journal of the Korea Society of Computer and Information B, 2002.
C. Lee, M. Kim, J. Paik, H. Park, "Text Summarization using PCA and SVD," Journal of Korea Information Processing Society B, 2003.
S. Park, J. Lee, "Topic-basied Multi-document Summarization Using Non-negative Matrix Factorization and K-means," Journal of the Korea Society of Computer and Information B, 2008.
S. Park, D. U. An, B. R. Char, and C. W. Kim, "Document Clustering with Cluster Refinement and Non-negative Matrix Factorization," In Proceeding of ICONIP'09, 2009.
S. Osinski and D. Weiss, "Conceptua Clustering using lingo algorithm: Evaluation on open directory project data," in Proc. IIPWM04, 2004.
The Porter Stemming Algorithm. Available: http://tartarus.org/-martin/PorterStemmer/
B. Lee, Information Retrieval, Green Pub. 2012.
http://qwone.com/-jason/20Newsgroups/

Cited by

A Study on Efficient Memory Management Using Machine Learning Algorithm vol.6, pp.1, 2014, https://doi.org/10.7236/ijasc.2017.6.1.39
단어 임베딩(Word Embedding) 기법을 적용한 키워드 중심의 사회적 이슈 도출 연구: 장애인 관련 뉴스 기사를 중심으로 vol.35, pp.1, 2014, https://doi.org/10.3743/kosim.2018.35.1.231
잠재 의미 분석을 적용한 유사 특허 검색 서비스 시스템 vol.22, pp.8, 2014, https://doi.org/10.6109/jkiice.2018.22.8.1049
K-Means 군집모형과 계층적 군집(교차효율성 메트릭스에 의한 평균연결법, Ward법)모형 및 혼합모형을 이용한 컨테이너항만의 클러스터링 측정에 대한 실증적 비교 및 검증에 관한 연구 vol.34, pp.3, 2014, https://doi.org/10.38121/kpea.2018.09.34.3.17

Journal of the Korea Institute of Information and Communication Engineering (한국정보통신학회논문지)

Document Clustering Technique by K-means Algorithm and PCA

주성분 분석과 k 평균 알고리즘을 이용한 문서군집 방법

Abstract

Keywords

References

Cited by

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)