Browse > Article

An Efficient Method for Detecting Duplicated Documents in a Blog Service System  

Lee, Sang-Chul (한양대학교 전자컴퓨터통신공학)
Lee, Soon-Haeng (한양대학교 전자컴퓨터통신공학)
Kim, Sang-Wook (한양대학교 전자컴퓨터통신공학)
Abstract
Duplicate documents in blog service system are one of causes that deteriorate both of the quality and the performance of blog searches. Unlike the WWW environment, the creation of documents is reported every time in blog service system, which makes it possible to identify the original document from its duplicate documents. Based on this observation, this paper proposes a novel method for detecting duplication documents in blog service system. This method determines whether a document is original or not at the time it is stored in the blog service system. As a result, it solves the problem of duplicate documents retrieved in the search result by keeping those documents from being stored in the index for the blog search engine. This paper also proposes three indexing methods that preserve an accuracy of previous work, Min-hashing. We show most effective indexing method via extensive experiments using real-life blog data.
Keywords
Duplicate document detection; Blog; Search engine;
Citations & Related Records
연도 인용수 순위
  • Reference
1 M. Henzinger, "Finding Near-Duplicate Web Pages: A Large-Scale Evaluation of Algorithms," In Proc. ACM Int'l. Conf. on Information Retrieval, SIGIR, pp.284-291, 2006.
2 A. Broder et al., "Min-Wise Independent Permutations," Journal of Computer and System Sciences, vol.60, no.3, pp.630-659, 2000.
3 A. Broder, "Identifying and Filtering Near-Duplicate Documents," In Proc. Int'l. Symp. on Combinatorial Pattern Matching, CPM, pp.1-10, 2000.
4 A. Broder, "Identifying and Filtering Near-Duplicate Documents," In Proc. Int'l. Symp. on Combinatorial Pattern Matching, CPM, pp.1-10, 2000.
5 N. Beckmann et al., "The R*-tree: An Efficient and Robust Access Method for Points and Rectangles," In Proc. ACM Int'l. Conf. on Management of Data, SIGMOD, pp.322-331, 1990
6 M. Rabin, "Fingerprinting by Random Polynomials," Technical Report TR-CSE-03-01, Harvard University, 1981.
7 A. Broder et al., "Syntactic Clustering of the Web," In Proc. Int'l. Web Wide World Wide Web Conference, WWW, pp.391-404, 1997.
8 SK Communications, http://www.egloos.com.
9 Jong Wook Kim, K. Selcuk Candan, and Junichi Tatemura, "Efficient Overlap and Content Reuse Detection in Blogs and Online News Articles," In Proc. Int'l. World Wide Web Conference, WWW, pp.81-90, 2009.