A Search-Result Clustering Method based on Word Clustering for Effective Browsing of the Paper Retrieval Results

논문 검색 결과의 효과적인 브라우징을 위한 단어 군집화 기반의 결과 내 군집화 기법

  • Received : 2009.04.17
  • Accepted : 2009.12.11
  • Published : 2010.03.15

Abstract

The search-results clustering problem is defined as the automatic and on-line grouping of similar documents in search results returned from a search engine. In this paper, we propose a new search-results clustering algorithm specialized for a paper search service. Our system consists of two algorithmic phases: Category Hierarchy Generation System (CHGS) and Paper Clustering System (PCS). In CHGS, we first build up the category hierarchy, called the Field Thesaurus, for each research field using an existing research category hierarchy (KOSEF's research category hierarchy) and the keyword expansion of the field thesaurus by a word clustering method using the K-means algorithm. Then, in PCS, the proposed algorithm determines the category of each paper using top-down and bottom-up methods. The proposed system can be used in the application areas for retrieval services in a specialized field such as a paper search service.

검색 결과 내 군집화(search-result clustering)는 검색 엔진으로부터 검색된 결과 내에서 비슷한 문서를 자동으로 군집화하는 기법이다. 본 논문에서는 논문 검색 서비스에 전문화된 새로운 결과 내 군집화 기법을 제안한다. 제안하는 시스템은 '범주체계생성기(Category Hierarchy Generation System)'와 '논문군집기(Paper Clustering System)'로 구성되어있다. '범주체계생생기'는 KOSEF의 연구 범주 체계를 이용하여 분야 시소러스라 불리는 범주 체계를 생성하고, K-means 알고리즘을 이용한 단어 군집화 알고리즘을 사용하여 분야 시소러스의 키워드 집합을 확장한다. '논문군집기'는 top-down 방식과 bottom-up 방식을 이용하여 각 논문의 범주를 결정한다. 제안하는 시스템은 논문 검색 서비스와 같은 전문 분야에 대한 검색 서비스에 유용하게 사용될 수 있을 것이다.

Keywords

References

  1. G. Mecca, S. Raunich, A. Pappalardo, "A new algorithm for clustering search results," Proc. Data & Knowledge Engineering, pp.504-22, 2007.
  2. M. A. Hearst and J. O. Pedersen, "Reexamining the cluster hypothesis: Scatter/gather on retrieval results," Proc. SIGIR-96, pp.76-84, 1996.
  3. O. Zamir and O. Etzioni, "Grouper: a dynamic clustering interface to Web search results," Proc. Computer Networks: The International Journal of Computer and Telecommunications Networking, pp.1361-1374, 1999.
  4. F. Giannotti, M. Nanni, and D. Pedreschi, "Webcat: Automatic categorization of web search results," Proc. SEBD'2003, pp.507-518, 2003.
  5. Z. Jiang, A. Joshi, R. Krishnapuram, and L. Yi, "Retriever: Improving web search engine results using clustering," Proc. Managing Business with Electronic Commerce 02, pp.59-81, 2002.
  6. S. Osinski and D. Weiss, "Conceptual Clustering using lingo algorithm: Evaluation on open directory project data," Proc. IIPWM04, pp.369-377, 2004.
  7. S. Osinski, J. Stefanowski, D. Weiss, "Lingo: Search results Clustering algorithm based on singular value decomposition," Proc. the International Conference on Intelligent Information Systems (IIPWM), pp.359-368, 2004.
  8. P. Ferragina, A. Gulli, "A personalized search engine based on web snippet hierarchical clustering," Proc. the World Wide Web Conference, pp.189- 225, 2005.
  9. B. Fung, K. Wang, and M. Ester, "Large hierarchical document clustering using frequent itemsets," In SDM03, 2003.
  10. D. Zhang and Y. Dong, "Semantic, hierarchical, online clustering of web search results," Proc. The 3rd International Workshop on Web Information and Data, pp.69-78, 2004.
  11. D. J. Lawrie and W. B. Croft, "Generating hiearchical summaries for web searches," In SIGIR03, 2003.
  12. Y. Wu and X. Chen, "Extracting features from web search returned hits for hierarchical classification," Proc. International Conference on Information and Knowledge Engineering(IKE'03), pp. 103-108, 2003.