Topic based Web Document Clustering using Named Entities

Sung, Ki-Youn;Yun, Bo-Hyun;

doi:10.5392/JKCA.2010.10.5.029

The Journal of the Korea Contents Association (한국콘텐츠학회논문지)

Volume 10 Issue 5
/
Pages.29-36
/
2010
/
1598-4877(pISSN)
/
2508-6723(eISSN)

The Korea Contents Association (한국콘텐츠학회)

DOI QR Code

Topic based Web Document Clustering using Named Entities

개체명을 이용한 주제기반 웹 문서 클러스터링

성기윤 (이니텍(주)) ;
윤보현 (목원대학교)

Received : 2010.05.03
Accepted : 2010.05.11
Published : 2010.05.28

https://doi.org/10.5392/JKCA.2010.10.5.029 Citation PDF KSCI

Download PDF

⟨ Previous Next ⟩

Abstract

Past clustering researches are focused on extraction of keyword for word similarity grouping. However, too many candidates to compare and compute bring high complexity, low speed and low accuracy. To overcome these weaknesses, this paper proposed a topical web document clustering model using not only keyword but also named entities such as person name, organization, location, and so on. By several experiments, we prove effects of our model compared with traditional model based on only keyword and analyze how different effects show according to characteristics of document collection.

종래의 클러스터링 기법은 단순히 키워드를 추출에 기반한 단어간 유사도에 의한 그룹핑 방식을 구사함으로써 비교해야 할 대상 키워드 수 및 종류가 매우 다양하여 계산량이 증가함으로써 속도가 느리고 정확도도 높지 않은 편이다. 본 논문은 이러한 단점을 해소하기 위해 웹 문서를 대상으로 기존 명사 위주의 키워드 뿐 아니라 인명, 지명, 회사명, 물품명 등을 자동으로 인식하는 개체명 인식 결과를 이용하는 웹클러스터링 기법을 제안하고자 한다. 실험을 통해 기존 키워드 기반 클러스터링 결과에 비해 개체명 기반클러스터링의 품질이 우수함을 증명하였으며, 문서 집합 특성에 따른 클러스터링 결과도 비교 분석하였다.

Keywords

References

H. J. Oh, S. H. Myaeng, and M. G. Jang,“Enhancing Performance with a Learnable Strategy for Multiple Question Answering Modules,” ETRI Journal, Vol.31, No.4, 2009.
Oren Zamir, “Fast and Intuitive Clustering of Web Documents,” Qual's Paper, University of Washington.
Oren Zamir and Oren Etzioni, “Web Document Clustering: A Feasibility Demonstration,” Proc. of ACM SIGIR'98, 1998.
Oren Zamir and Oren Etzioni, “Grouper: A Dynamic Clustering Interface to Web Search Results,” Proc. of WWW8, pp.1361-1374, 2009.
Soto Montalvo and Raquel Martinex, "Bilingual New Clustering Using Named Entities and Fuzzy Similarity," Proc. of 10th TSD, 2007.
Hiroyuki Toda and Ryoji Kataoka, "search result clustering method using informatively named entities," Proc. of ACM internationa workshop on WIDM, pp.1-86, 2005.
Gang Wei, "Named Entity Recognition and An Apply on Document Clustering," MSCs thesis, Dalhousie University, 2004.
C. K. Lee, Y. G. Hwang, and S. J. Lim, "Fine-Grained Named Entity Recognition Using Conditional Random Fields for Question Answering," Proc. AIRS-06, LNCS Vol.4182, pp.581-587, 2006.
B. William, Frakes, and Richard Baeza-Yates, “Clustering Algorithm,” Information Retrieval Data Structure and Algorithm, Chapter 16.
Ricardo Baeza-Yates and Berthier Ribeiro-Neto, “Modern Information Retrieval,” Addison-Wesley, 1999.