Scaling Reuse Detection in the Web through Two-way Boosting with Signatures and LSH

Kim, Jong Wook;

doi:10.9717/kmms.2013.16.6.735

Journal of Korea Multimedia Society (한국멀티미디어학회논문지)

Volume 16 Issue 6
/
Pages.735-745
/
2013
/
1229-7771(pISSN)
/
2384-0102(eISSN)

Korea Multimedia Society (한국멀티미디어학회)

DOI QR Code

Scaling Reuse Detection in the Web through Two-way Boosting with Signatures and LSH

Kim, Jong Wook (Teradata Labs)

Received : 2012.04.24
Accepted : 2013.05.04
Published : 2013.06.30

https://doi.org/10.9717/kmms.2013.16.6.735 Citation PDF KSCI

Download PDF

⟨ Previous Next ⟩

Abstract

The emergence of Web 2.0 technologies, such as blogs and wiki, enable even naive users to easily create and share content on the Web using freely available content sharing tools. Wide availability of almost free data and promiscuous sharing of content through social networking platforms created a content borrowing phenomenon, where the same content appears (in many cases in the form of extensive quotations) in different outlets. An immediate side effect of this phenomenon is that identifying which content is re-used by whom is becoming a critical tool in social network analysis, including expert identification and analysis of information flow. Internet-scale reuse detection, however, poses extremely challenging scalability issues: considering the large size of user created data on the web, it is essential that the techniques developed for content-reuse detection should be fast and scalable. Thus, in this paper, we propose a $qSign_{lsh}$ algorithm, a mechanism for identifying multi-sentence content reuse among documents by efficiently combining sentence-level evidences. The experiment results show that $qSign_{lsh}$ significantly improves the reuse detection speed and provides high recall.

Keywords

References

K.H. Hyun, "Video Matching Algorithm of Content-Based Video Copy Detection for Copyright Protection," Journal of Korea Multimedia Society, Vol. 11, No. 3, pp. 315-322, 2008.
D. Metzler, Y. Bernstein, W.B. Croft, A. Moffat, and J. Zobel, "Similarity Measures for Tracking Information Flow," Proc. the Conference on Information and Knowledge Management, pp. 517-524, 2005.
X. Chen, B. Francia, M. Li, and B. Mckinnon, "Shared Information and Program Plagiarism Detection," IEEE Transactions on Information Theory, Vol. 50, No. 7, pp. 1545-1551, 2004. https://doi.org/10.1109/TIT.2004.830793
N. Shivakumar and H. Garcia-Molina, "SCAM: A Copy Detection Mechanism for Digital Documents," Second Annual Conference on the Theory and Practice of Digital Libraries, 1995.
Y. Bernstein and J. Zobel, "A Scalable System for Identifying Co-derivative Documents." Proc. String Processing and Information Retrieval Symp, pp. 56-67, 2004.
S. Schleimer, D.S. Wilkerson, and A. Aiken, "Winnowing: Local Algorithms for Document Fingerprinting," Proc. the ACM SIGMOD International Conference, pp. 76-85, 2003.
R.J. Bayardo, Y. Ma, and R. Srikant, "Scaling up All Pairs Similarity Search," Proc. International World Wide Web Conference, pp. 131-140, 2007.
J. Lin, "Brute Force and Indexed Approaches to Pairwise Document Similarity Comparisons with MapReduce," Proc. the international ACM SIGIR conference , pp. 155-161, 2009.
T. Elsayed, J. Lin, and D. Oard, "Pairwise Document Similarity in Large Collections with MapReducee," Proc. Annual Meeting of the Association of Computational Linguistics, pp. 265-268, 2008.
M. Theobald, J. Siddharth, and A. Paepcke, "SpotSigs: Robust and Efficient Near Duplicate Detection in Large Web Collections," Proc. the International ACM SIGIR Conference, pp. 563-570, 2008.
C. Xiao, W. Wang, X. Lin, and J.X. Yu, "Efficient Similarity Joins for Near Duplicate Detection," Proc. International World Wide Web Conference, pp. 131-140, 2008.
J.W. Kim, K.S. Candan, and J. Tatemura, "Efficient Overlap and Content Reuse Detection in Blogs and Online News Articles," Proc. International World Wide Web Conference, pp. 81-90, 2009.
P. Indyk, and R. Motwani, "Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality," ACM Symposium on the Theory of Computing, pp. 604-613, 1998.
A. Gionis, P. Indyk, and R. Motwani, "Similarity Search in High Dimensions via Hashing," Proc. the International Conference on Very Large Data Bases, pp. 518-529, 1999.
N. Beckmann, H.P. Kriegel, R. Schneider, and B. Seeger, "The R*-tree: An Efficient and Robust Access Method for Points and Rectangles," Proc. the ACM SIGMOD International Conference, pp. 322-331, 1990.
J.T. Robinson, "The K-D-B-tree: A Search Structure for Large Multidimensional Dynamic Indexes," Proc. the ACM SIGMOD International Conference, pp. 10-18, 1981.
S. Berchtold, C. Bohm, and H.P. Kriegel, "The Pyramid-Technique: Towards Breaking the Curse of Dimensionality," Proc. the ACM SIGMOD International Conference, pp. 142-153, 1998.
K. Chakrabarti and S. Mehrotra, "The Hybrid Tree: An Index Structure for High Dimensional Feature Spaces," Proc. the International Conference on Data Engineering, pp. 440-447, 1999.
A. Andoni and P. Indyk, "Near-optimal Hashing Algorithms for Approximate Nearest Neighbor in High Dimensions," Communications of the ACM, Vol. 51, No. 1, pp. 117-122, 2008.
J. Zobel, A. Moffat, and K. Ramamohanarao, "Inverted Files versus Signature Files for Text Indexing," ACM Transactions on Database Systems, Vol. 23, No. 4, pp. 453-490, 1998. https://doi.org/10.1145/296854.277632
A. Arasu, V. Ganti, and R. Kaushik, "Efficient Exact Set-similarity Joins," Proc. the International Conference on Very Large Data Bases, pp. 918-929, 2006.
Google Blog Search. http://blogsearch.google.com/blogsearch, 2013.
Google News. http://news.google.com, 2013.
Workshop on the Weblogging Ecosystem: Aggregation, Analysis and Dynamics, 2006.
C. Li, J. Lu, and Y. Lu, "Efficient Merging and Filtering Algorithms for Approximate String Searches," Proc. the International Conference on Data Engineering, pp. 257-266, 2008.
A. Chowdhury, O. Frieder, D. Grossman, and M.C. McCabe, "Collection Statistics for Fast Duplicate Document Detection," ACM Transactions on Information Systems, Vol. 20, No. 2, pp. 171-191, 2002. https://doi.org/10.1145/506309.506311
N. Shrivakumar and H. Garcia-Molina, "Finding Near-replicas of Documents on the Web," International Workshop on the World Wide Web and Databases, pp. 204-212, 1998.
H. Yang and J. Callan, "Near-duplicate Detection by Instance-level Constrained Clustering," Proc. the international ACM SIGIR conference, pp. 421-428, 2006.
L. Gravano, P.G. Ipeirotis, H.V. Jagadish, N.Koudas, S. Muthukrishnan, and D. Srivastava, "Approximate String Joins in a Database (almost) for Free," Proc. the International Conference on Very Large Data Bases, pp. 491-500, 2001.
S. Chaudhuri, V. Ganti, and R. Kaushik, "A Primitive Operator for Similarity Joins in Data Cleaning," Proc. the International Conference on Data Engineering, pp. 5-16, 2006.
S. Sarawagi, and A. Kirpa, "Efficient Set Joins on Similarity Predicates," Proc. the ACM SIGMOD International Conference, pp. 743-754, 2004.
L. Huang, L. Wang, and X. Li, "Achieving Both High Precision and High Recall in Near-duplicate Detection," Proc. the Conference on Information and Knowledge Management, pp. 63-72, 2008.
J. Dean and S. Ghemawat, "MapReduce: Simplified Data Processing on Large Clusters," USENIX Symposium on Operating Systems Design and implementation, pp. 137-150, 2004.
Yahoo!, "Hadoop". http://hadoop.apache.org, 2013.