Comparison of Document Clustering algorithm using Genetic Algorithms by Individual Structures

Choi, Lim-Cheon;Song, Wei;Park, Soon-Cheol;

doi:10.9723/jksiis.2011.16.3.047

한국산업정보학회논문지 (Journal of Korea Society of Industrial Information Systems)

제16권3호
/
Pages.47-56
/
2011
/
1229-3741(pISSN)

한국산업정보학회 (Korea Society of Industrial Information Systems)

DOI QR Code

개체 구조에 따른 유전자 알고리즘 기반의 문서 클러스터링 성능 비교

Comparison of Document Clustering algorithm using Genetic Algorithms by Individual Structures

최임천 (전북대학교 컴퓨터 공학과) ;
쏭웨이 ;
박순철 (전북대학교 전자정보공학부)

Choi, Lim-Cheon ;
Song, Wei (School of Information Technology, Jiangnan University) ;
Park, Soon-Cheol

투고 : 2011.05.25
심사 : 2011.07.05
발행 : 2011.09.30

https://doi.org/10.9723/jksiis.2011.16.3.047 인용 PDF KSCI

PDF 다운로드

⟨ 이전 논문 다음 논문 ⟩

초록

유전자 알고리즘을 문서 클러스터링에 적용하기 위해서는 적절한 개체 구조가 필요 하다. 기존의 유전자 알고리즘을 이용한 문서 클러스터링(DCGA)은 센트로이드 벡터 형식의 개체 구조를 사용하였다. 새로운 유전자 알고리즘을 이용한 문서 클러스터링(NDAGA)은 문서 할당 형식의 개체 구조를 사용한다. 본 논문에서는 문서 클라스터링에 더 적합한 개체 구조와 연산을 결정하기 위해 두 개체 구조의 차이에 따른 연산, 연산량, 클러스터링 수행 시간, 성능을 구체적으로 비교, 분석한다. 본 논문에서 수행한 다양한 실험에서 NDCGA가 DCGA와 비교하여 15%정도 더 빠른 수행 시간과, 약 5~10% 정도 더 높은 성능을 보여, 문서 할당 형식의 개체 구조가 센트로이드 벡터 형식의 개체 구조 보다 문서 클러스터링에 적합한 것을 증명한다. 또한 NDCGA는 전통적인 클러스터링 알고리즘들(K-means, Group Average)에 비해서 15~20% 더 좋은 성능을 보였다.

To apply Genetic algorithm toward document clustering, appropriate individual structure is required. Document clustering with the genetic algorithms (DCGA) uses the centroid vector type individual structure. New document clustering with the genetic algorithm (NDAGA) uses document allocated individual structure. In this paper, to find more suitable object structure and process for the document clustering, calculation, amount of calculation, run-time, and performance difference between the two methods were analyzed. In this paper, we have performed various experiments using both DCGA and NDCGA. Result of the experiment shows that compared to DCGA, NDCGA provided 15% faster execution time, about 5~10% better performance. This proves that the document allocated structure is more fitted than the centroid vector type structure when it comes to document clustering. In addition, NDCGA showed 15~25% better performance than the traditional clustering algorithms (K-means, Group Average).

키워드

참고문헌

B. Y. llicardo and R. N. Berthier, Modem infonnation retrieval, Addison Wesley, 1999.
정영미, "정보 검색 연구", 구미무역, 2005
Christopher D. Manning, Prabhakar Raghavan & Hinrich Schutze, "Introduction to Infonnation Retrieval", 2008
Beil, F., Ester, M., & Xu, X. (2002), "Frequent tenn-based text clustering". Intemational knowledge Discovery and Data Mining, KDD'02, Eclmonton, Alberta, Canada, 436-442
S. Selim and M. Ismail, "K -means-type algorithm generalized convergence theorem and characterization of local optimality", IEEE Trans. Pattem Anal. Mach Intell. vol. 6, pp. 81 -87, 1984.
YING ZHAS, GEORGE KARYPIS, "Hierarchical Clustering Algorithms for Document Datasets", Data Mining and Knowledge Discovery, 10, 141-168, 2005 https://doi.org/10.1007/s10618-005-0361-3
W. Song, S.C. Park, Genetic algorithm-based text clustering technique, LNCS 4221 (2006) 779-782.
W. Song, S.C Park, "Genetic algorithm for text clustering based on latent semantic indexing", Computers and Mathematics with Applications, vol. 57, pp. 1901-1907, 2009 https://doi.org/10.1016/j.camwa.2008.10.010
최임천, 박순철, "클러스터 측정과 유전자 알고리 즘을 이용한 문서 클러스터링", 한국정보처리학회 추계학술대회 논문집, 제 17권, 2호, pp. 490- 493, 2010.11
L. Davis(Ed), "Handbook of Genetic Algorithms", Van Nostrand Reinhold, New York, 1991
김대희, 박상호, "분류시스템의 분류 규칙 발견을 위한 유전자 알고리즘" 한국산업정보학회 논문지, 제9권, 4호, pp.16 - 25, 2004
David E. Goldberg, "Genetic Algorithms in Search, Optimization and Machine Learning", Addison Wesley, 1989.
U. Maulik, S. Bandyopadhyay, "Genetic algorithm-based clustering technique", Patten Recognition. vol. 33, pp. 1455-1465，2000 https://doi.org/10.1016/S0031-3203(99)00137-5
http//www.kristalinfo.com/TestCollections/
리청화, 변동률, 박순철, "한글문서분류에 SVD를 이용한 BPNN 알고리즘", 한국산업정보학회 논문지, 제 15권 2호，pp. 49-57, 2010. 6
S.C. Deerwester, S.T. Dumais, T.K Landauer, G.W. Fumas, R.A. Harshman, "Indexing by latent semantic analysis", J. Amer. Soc. Infonn. Sci. 41(1990)

피인용 문헌

다목적 유전자 알고리즘을 이용한문서 클러스터링 vol.17, pp.2, 2011, https://doi.org/10.9723/jksiis.2012.17.2.057
이산 푸리에 변환을 적용한 텍스트 패턴 분석에 관한 연구 - 표절 문장 탐색 중심으로 - vol.22, pp.2, 2017, https://doi.org/10.9723/jksiis.2017.22.2.043

한국산업정보학회논문지 (Journal of Korea Society of Industrial Information Systems)

개체 구조에 따른 유전자 알고리즘 기반의 문서 클러스터링 성능 비교

Comparison of Document Clustering algorithm using Genetic Algorithms by Individual Structures

초록

키워드

참고문헌

피인용 문헌

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

자세히 찾기

이미지 검색 (β)