Browse > Article

A Sampling-based Algorithm for Top-${\kappa}$ Similarity Joins  

Park, Jong Soo (성신여자대학교 IT학부)
Abstract
The problem of top-${\kappa}$ set similarity joins finds the top-${\kappa}$ pairs of records ranked by their similarities between two sets of input records. We propose an efficient algorithm to return top-${\kappa}$ similarity join pairs using a sampling technique. From a sample of the input records, we construct a histogram of set similarity joins, and then compute an estimated similarity threshold in the histogram for top-${\kappa}$ join pairs within the error bound of 95% confidence level based on statistical inference. Finally, the estimated threshold is applied to the traditional similarity join algorithm which uses the min-heap structure to get top-${\kappa}$ similarity joins. The experimental results show the good performance of the proposed algorithm on large real datasets.
Keywords
top-${\kappa}$ similarity join; sampling; 95% confidence level; performance evaluation;
Citations & Related Records
Times Cited By KSCI : 1  (Citation Analysis)
연도 인용수 순위
1 W. Zhang, J. Xu, X. Liang, Y. Zhang, and X. Lin, "Top-k Similarity Join over Multi-valued Objects," Database Systems for Advanced Applications, Lecture Notes in Computer Science, vol.7238, pp. 509-525, 2012.
2 R.V. Hogg and E.A. Tanis, Probability and Statistical Inference, 7th Ed., Pearson, 2005.
3 T.H. Cormen, et al., Introduction to Algorithms, 2nd Ed., McGraw-Hill, 2001.
4 S. Acharya, P.B. Gibbons, V. Poosala, and S. Ramaswany, "Join Synopses for Approximate Query Answering," ACM SIGMOD, pp.275-286, 1999.
5 R. J. Bayardo, Y. Ma, and R. Srikant, "Scaling up all pairs similarity search," In Proceedings of the WWW'07, pp.131-140, 2007.
6 C. Xiao, W. Wang, X. Lin, J.X. Yu, and G. Wang, "Efficient Similarity Joins for Near-Duplicate Detection," ACM TODS, vol.36, no.3, Article 15, Aug. 2011.
7 L. A. Ribeiro and T. Harder, "Generalizing prefix filtering to improve set similarity joins," Information Systems 36, pp.62-78, 2011.   DOI
8 J.S. Park, "Efficient Similarity Joins by Adaptive Prefix Filtering," KIPS Tr. Software and Data Eng., vol.2, pp.267-270, 2013. (in Korean)   DOI
9 C. Xiao, W. Wang, X. Lin, and H. Shang, "Top-k set similarity joins," IEEE ICDE'09, pp.916-927, 2009.
10 Y. Kim and K. Shim, "Parallel Top-K Similarity Join Algorithms Using MapReduce," IEEE ICDE 2012, pp.510-521, 2012.