[KSCI] Korea Science Citation Index Service

http://dx.doi.org/10.5626/JOK.2016.43.11.1275

An Efficient String Similarity Search Technique based on Generating Inverted Lists of Variable-Length Grams

Kim, Jongik (Chonbuk National Univ.)

Publication Information

Journal of KIISE / v.43, no.11, 2016 , pp. 1275-1280 More about this Journal

Abstract

Existing techniques for string similarity search first generate a set of candidate strings and then verify the candidates. The efficiency of string similarity search is highly dependent on candidate generation methods. State of the art techniques select fixed length q-grams from a query string and generate candidates using inverted lists of the selected q-grams. In this paper, we propose a technique to generate candidates using variable length grams of a query string and develop a dynamic programming algorithm that selects an optimal combination of variable length grams from a query string. Experimental results show that the proposed technique improves the performance of string similarity search compared with the existing techniques.

Keywords

string similarity search; edit distance; variable-length gram; q-gram inverted lists;

Citations & Related Records

Reference

1	S. Sarawagi and A. Kirpal, "Efficient set joins on similarity predicates," Proc. of the ACM SIGMOD Conference, pp. 743-754, 2004.
2	C. Li, J. Lu, and Y. Lu, "Efficient merging and filtering algorithms for approximate string searches," IEEE ICDE, pp. 257-266, 2008.
3	C. Xiao, W. Wang, X. Lin, Ed-join: an efficient algorithm for similarity joins with edit distance constraints, PVLDB, 1, pp. 933-944, 2008.
4	J. Kim and H. Lee, "Efficient exact similarity searches using multiple token orderings," IEEE ICDE, 2012.
5	S. Chaudhuri, V. Ganti, and R. Kaushik, "A primitive operator for similarity joins in data cleaning," IEEE ICDE, p. 5, 2006.
6	J. Kim, "An Effective Candidate Generation Method for Improving Performance of Edit Similarity Query Processing," Information Systems, Vol. 47, No. 1, pp. 116-128, 2015. DOI
7	G. Li, D. Deng, J. Wang, and J. Feng, "Pass-join: A partition based method for similarity joins," PVLDB, Vol. 5, pp. 253-264, 2011.
8	J.Qin, W.Wang, Y.Lu, C.Xiao, and X.Lin, "Efficient exact edit similarity query processing with asymmetric signature schemes," Proc. of the ACM SIGMOD Conference, pp. 1033-1044, 2011.
9	C. Li, B. Wang, and X. Yang, "VGRAM: Improving performance of approximate queries on string collections using variable-length grams," VLDB, pp. 303-314, 2007.
10	J. Kim, C. Li, and X. Xie, "Improving read mapping using additional prefix grams," BMC Bioinformatics, 15(1):42, 2014. DOI

KSCI

An Efficient String Similarity Search Technique based on Generating Inverted Lists of Variable-Length Grams 가변길이 그램의 역리스트 생성을 이용한 효율적인 유사 문자열 검색 기법

An Efficient String Similarity Search Technique based on Generating Inverted Lists of Variable-Length Grams