References
- K.H. Hyun, "Video Matching Algorithm of Content-Based Video Copy Detection for Copyright Protection," Journal of Korea Multimedia Society, Vol. 11, No. 3, pp. 315-322, 2008.
- D. Metzler, Y. Bernstein, W.B. Croft, A. Moffat, and J. Zobel, "Similarity Measures for Tracking Information Flow," Proc. the Conference on Information and Knowledge Management, pp. 517-524, 2005.
- X. Chen, B. Francia, M. Li, and B. Mckinnon, "Shared Information and Program Plagiarism Detection," IEEE Transactions on Information Theory, Vol. 50, No. 7, pp. 1545-1551, 2004. https://doi.org/10.1109/TIT.2004.830793
- N. Shivakumar and H. Garcia-Molina, "SCAM: A Copy Detection Mechanism for Digital Documents," Second Annual Conference on the Theory and Practice of Digital Libraries, 1995.
- Y. Bernstein and J. Zobel, "A Scalable System for Identifying Co-derivative Documents." Proc. String Processing and Information Retrieval Symp, pp. 56-67, 2004.
- S. Schleimer, D.S. Wilkerson, and A. Aiken, "Winnowing: Local Algorithms for Document Fingerprinting," Proc. the ACM SIGMOD International Conference, pp. 76-85, 2003.
- R.J. Bayardo, Y. Ma, and R. Srikant, "Scaling up All Pairs Similarity Search," Proc. International World Wide Web Conference, pp. 131-140, 2007.
- J. Lin, "Brute Force and Indexed Approaches to Pairwise Document Similarity Comparisons with MapReduce," Proc. the international ACM SIGIR conference , pp. 155-161, 2009.
- T. Elsayed, J. Lin, and D. Oard, "Pairwise Document Similarity in Large Collections with MapReducee," Proc. Annual Meeting of the Association of Computational Linguistics, pp. 265-268, 2008.
- M. Theobald, J. Siddharth, and A. Paepcke, "SpotSigs: Robust and Efficient Near Duplicate Detection in Large Web Collections," Proc. the International ACM SIGIR Conference, pp. 563-570, 2008.
- C. Xiao, W. Wang, X. Lin, and J.X. Yu, "Efficient Similarity Joins for Near Duplicate Detection," Proc. International World Wide Web Conference, pp. 131-140, 2008.
- J.W. Kim, K.S. Candan, and J. Tatemura, "Efficient Overlap and Content Reuse Detection in Blogs and Online News Articles," Proc. International World Wide Web Conference, pp. 81-90, 2009.
- P. Indyk, and R. Motwani, "Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality," ACM Symposium on the Theory of Computing, pp. 604-613, 1998.
- A. Gionis, P. Indyk, and R. Motwani, "Similarity Search in High Dimensions via Hashing," Proc. the International Conference on Very Large Data Bases, pp. 518-529, 1999.
- N. Beckmann, H.P. Kriegel, R. Schneider, and B. Seeger, "The R*-tree: An Efficient and Robust Access Method for Points and Rectangles," Proc. the ACM SIGMOD International Conference, pp. 322-331, 1990.
- J.T. Robinson, "The K-D-B-tree: A Search Structure for Large Multidimensional Dynamic Indexes," Proc. the ACM SIGMOD International Conference, pp. 10-18, 1981.
- S. Berchtold, C. Bohm, and H.P. Kriegel, "The Pyramid-Technique: Towards Breaking the Curse of Dimensionality," Proc. the ACM SIGMOD International Conference, pp. 142-153, 1998.
- K. Chakrabarti and S. Mehrotra, "The Hybrid Tree: An Index Structure for High Dimensional Feature Spaces," Proc. the International Conference on Data Engineering, pp. 440-447, 1999.
- A. Andoni and P. Indyk, "Near-optimal Hashing Algorithms for Approximate Nearest Neighbor in High Dimensions," Communications of the ACM, Vol. 51, No. 1, pp. 117-122, 2008.
- J. Zobel, A. Moffat, and K. Ramamohanarao, "Inverted Files versus Signature Files for Text Indexing," ACM Transactions on Database Systems, Vol. 23, No. 4, pp. 453-490, 1998. https://doi.org/10.1145/296854.277632
- A. Arasu, V. Ganti, and R. Kaushik, "Efficient Exact Set-similarity Joins," Proc. the International Conference on Very Large Data Bases, pp. 918-929, 2006.
- Google Blog Search. http://blogsearch.google.com/blogsearch, 2013.
- Google News. http://news.google.com, 2013.
- Workshop on the Weblogging Ecosystem: Aggregation, Analysis and Dynamics, 2006.
- C. Li, J. Lu, and Y. Lu, "Efficient Merging and Filtering Algorithms for Approximate String Searches," Proc. the International Conference on Data Engineering, pp. 257-266, 2008.
- A. Chowdhury, O. Frieder, D. Grossman, and M.C. McCabe, "Collection Statistics for Fast Duplicate Document Detection," ACM Transactions on Information Systems, Vol. 20, No. 2, pp. 171-191, 2002. https://doi.org/10.1145/506309.506311
- N. Shrivakumar and H. Garcia-Molina, "Finding Near-replicas of Documents on the Web," International Workshop on the World Wide Web and Databases, pp. 204-212, 1998.
- H. Yang and J. Callan, "Near-duplicate Detection by Instance-level Constrained Clustering," Proc. the international ACM SIGIR conference, pp. 421-428, 2006.
- L. Gravano, P.G. Ipeirotis, H.V. Jagadish, N.Koudas, S. Muthukrishnan, and D. Srivastava, "Approximate String Joins in a Database (almost) for Free," Proc. the International Conference on Very Large Data Bases, pp. 491-500, 2001.
- S. Chaudhuri, V. Ganti, and R. Kaushik, "A Primitive Operator for Similarity Joins in Data Cleaning," Proc. the International Conference on Data Engineering, pp. 5-16, 2006.
- S. Sarawagi, and A. Kirpa, "Efficient Set Joins on Similarity Predicates," Proc. the ACM SIGMOD International Conference, pp. 743-754, 2004.
- L. Huang, L. Wang, and X. Li, "Achieving Both High Precision and High Recall in Near-duplicate Detection," Proc. the Conference on Information and Knowledge Management, pp. 63-72, 2008.
- J. Dean and S. Ghemawat, "MapReduce: Simplified Data Processing on Large Clusters," USENIX Symposium on Operating Systems Design and implementation, pp. 137-150, 2004.
- Yahoo!, "Hadoop". http://hadoop.apache.org, 2013.