[KSCI] Korea Science Citation Index Service

A Sampling-based Algorithm for Top- ${\kappa}$ Similarity Joins

Park, Jong Soo (성신여자대학교 IT학부)

Publication Information

Journal of KIISE:Databases / v.41, no.4, 2014 , pp. 256-261 More about this Journal

Abstract

The problem of top- ${\kappa}$ set similarity joins finds the top- ${\kappa}$ pairs of records ranked by their similarities between two sets of input records. We propose an efficient algorithm to return top- ${\kappa}$ similarity join pairs using a sampling technique. From a sample of the input records, we construct a histogram of set similarity joins, and then compute an estimated similarity threshold in the histogram for top- ${\kappa}$ join pairs within the error bound of 95% confidence level based on statistical inference. Finally, the estimated threshold is applied to the traditional similarity join algorithm which uses the min-heap structure to get top- ${\kappa}$ similarity joins. The experimental results show the good performance of the proposed algorithm on large real datasets.

Keywords

top- ${\kappa}$ similarity join; sampling; 95% confidence level; performance evaluation;

Citations & Related Records

Times Cited By KSCI : 1 (Citation Analysis)

Reference
Cited By KSCI

1	W. Zhang, J. Xu, X. Liang, Y. Zhang, and X. Lin, "Top-k Similarity Join over Multi-valued Objects," Database Systems for Advanced Applications, Lecture Notes in Computer Science, vol.7238, pp. 509-525, 2012.
2	R.V. Hogg and E.A. Tanis, Probability and Statistical Inference, 7th Ed., Pearson, 2005.
3	T.H. Cormen, et al., Introduction to Algorithms, 2nd Ed., McGraw-Hill, 2001.
4	S. Acharya, P.B. Gibbons, V. Poosala, and S. Ramaswany, "Join Synopses for Approximate Query Answering," ACM SIGMOD, pp.275-286, 1999.
5	R. J. Bayardo, Y. Ma, and R. Srikant, "Scaling up all pairs similarity search," In Proceedings of the WWW'07, pp.131-140, 2007.
6	C. Xiao, W. Wang, X. Lin, J.X. Yu, and G. Wang, "Efficient Similarity Joins for Near-Duplicate Detection," ACM TODS, vol.36, no.3, Article 15, Aug. 2011.
7	L. A. Ribeiro and T. Harder, "Generalizing prefix filtering to improve set similarity joins," Information Systems 36, pp.62-78, 2011. DOI
8	J.S. Park, "Efficient Similarity Joins by Adaptive Prefix Filtering," KIPS Tr. Software and Data Eng., vol.2, pp.267-270, 2013. (in Korean) DOI
9	C. Xiao, W. Wang, X. Lin, and H. Shang, "Top-k set similarity joins," IEEE ICDE'09, pp.916-927, 2009.
10	Y. Kim and K. Shim, "Parallel Top-K Similarity Join Algorithms Using MapReduce," IEEE ICDE 2012, pp.510-521, 2012.

KSCI

A Sampling-based Algorithm for Top- Similarity Joins Top- 유사도 조인을 위한 샘플링 기반 알고리즘

A Sampling-based Algorithm for Top- ${\kappa}$ Similarity Joins