A Similarity Join Algorithm Using a Median as a Filter

Park, Jong Soo;

doi:10.3745/KTSDE.2015.4.2.71

KIPS Transactions on Software and Data Engineering (정보처리학회논문지:소프트웨어 및 데이터공학)

Volume 4 Issue 2
/
Pages.71-76
/
2015
/
2287-5905(pISSN)
/
2734-0503(eISSN)

Korea Information Processing Society (한국정보처리학회)

DOI QR Code

A Similarity Join Algorithm Using a Median as a Filter

중앙값을 필터로 이용한 유사도 조인 알고리즘

Park, Jong Soo

박종수 (성신여자대학교 IT학부)

Received : 2014.09.12
Accepted : 2014.11.24
Published : 2015.02.28

https://doi.org/10.3745/KTSDE.2015.4.2.71 Citation PDF KSCI

Download PDF

⟨ Previous Next ⟩

Abstract

In similarity join processing, a general technique employs a generation-verification framework, which includes two phases: the first phase generates a set of candidate pairs from a collection of records; and the second phase verifies each candidate pair by computing real similarity. In order to reduce the number of candidate pairs in the verification phase, the median of one record of each candidate pair is used as a filter in this paper to test whether the other record can has the proper number of overlapped tokens. We propose a similarity join algorithm with the median filter, and show that the proposed algorithm has better performance in execution time than recent algorithms without the filter through extensive experiments on real-world datasets.

유사도 조인 처리에서 일반적인 기법은 생성-검증 구조를 사용하여, 첫 번째 생성 단계는 레코드들의 집합에서 후보 쌍들의 집합을 생성하고 두 번째 단계는 실제 유사도를 계산하여 각 후보 쌍을 검증한다. 검증 단계에서 후보 쌍들의 개수를 줄이기 위하여 본 논문에서는 각 후보 쌍의 한 레코드의 중앙값을 다른 레코드와 공통되는 토큰들의 개수가 적절하게 가질 수 있는지를 검사하는 필터로 사용한다. 중앙값 필터를 가지는 유사도 조인 알고리즘을 제안하고 제안된 알고리즘이 실세계 데이터집합에서 여러 실험을 통해 중앙값 필터를 갖지 않는 최근의 알고리즘들에 비해 실행시간에서 더 좋은 성능을 가진다는 것을 보여준다.

Keywords

References

L. Gravano, P.G. Ipeirotis, H.V. Jagadish, N. Koudas, S. Muthukrishnan, and D. Srivastava, "Approximate string joins in a database (almost) for free," In Proceedings of the 27th International Conference on Very Large Data Bases, Roma, pp.491-500, 2001.
S. Chaudhuri, V. Ganti, and R. Kaushik. "A primitive operator for similarity joins in data cleaning," In Proceedings of the 22nd International Conference on Data Engineering, Atlanta, pp.5-16, 2006.
M.R. Henzinger, "Finding near-duplicate web pages: A large-scale evaluation of algorithms," In Proceedings of the 29th annual international ACM SIGIR conference, Seattle, pp.284-291, 2006.
C. Xiao, W. Wang, X. Lin, and J.X. Yu, "Efficient Similarity Joins for Near Duplicate Detection," In Proceedings of the 17th international conference on World Wide Web, Beijing, pp.131-140, 2008.
R.J. Bayardo, Y. Ma, and R. Srikant, "Scaling up all pairs similarity search," In Proceedings of the 16th international conference on World Wide Web, Banff, pp.131-140, 2007.
J. Wang, G. Li, and J. Feng, "Can we beat the prefix filtering? An adaptive framework for similarity join and search," In Proceedings of the ACM SIGMOD International Conference on Management of Data, Scottsdale, pp.85-96, 2012.
Y. Jiang, G. Li, J. Feng, and W.-S., Li, "String Similarity Joins: An Experimental Evaluation," In the Proceedings of the 40th International Conference on Very Large Data Bases, Hangzhou, pp.625-636, 2014.
L.A. Ribeiro, T. Harder, "Generalizing prefix filtering to improve set similarity joins," Information Systems, vol.36, Issue.1, pp.62-78, 2011. https://doi.org/10.1016/j.is.2010.07.003
J.S. Park, "Efficient Similarity Joins by Adaptive Prefix Filtering," KIPS Transactions on Software and Data Engineering, Vol.2, No.4, pp.167-272, 2013.
J.S. Park, "A Sampling-based Algorithm for Top-k Similarity Joins," Journal of KIIES: Databases, vol.41, No.4, pp.256-261, 2014.