Browse > Article
http://dx.doi.org/10.5626/JOK.2016.43.11.1275

An Efficient String Similarity Search Technique based on Generating Inverted Lists of Variable-Length Grams  

Kim, Jongik (Chonbuk National Univ.)
Publication Information
Journal of KIISE / v.43, no.11, 2016 , pp. 1275-1280 More about this Journal
Abstract
Existing techniques for string similarity search first generate a set of candidate strings and then verify the candidates. The efficiency of string similarity search is highly dependent on candidate generation methods. State of the art techniques select fixed length q-grams from a query string and generate candidates using inverted lists of the selected q-grams. In this paper, we propose a technique to generate candidates using variable length grams of a query string and develop a dynamic programming algorithm that selects an optimal combination of variable length grams from a query string. Experimental results show that the proposed technique improves the performance of string similarity search compared with the existing techniques.
Keywords
string similarity search; edit distance; variable-length gram; q-gram inverted lists;
Citations & Related Records
연도 인용수 순위
  • Reference
1 S. Sarawagi and A. Kirpal, "Efficient set joins on similarity predicates," Proc. of the ACM SIGMOD Conference, pp. 743-754, 2004.
2 C. Li, J. Lu, and Y. Lu, "Efficient merging and filtering algorithms for approximate string searches," IEEE ICDE, pp. 257-266, 2008.
3 C. Xiao, W. Wang, X. Lin, Ed-join: an efficient algorithm for similarity joins with edit distance constraints, PVLDB, 1, pp. 933-944, 2008.
4 J. Kim and H. Lee, "Efficient exact similarity searches using multiple token orderings," IEEE ICDE, 2012.
5 S. Chaudhuri, V. Ganti, and R. Kaushik, "A primitive operator for similarity joins in data cleaning," IEEE ICDE, p. 5, 2006.
6 J. Kim, "An Effective Candidate Generation Method for Improving Performance of Edit Similarity Query Processing," Information Systems, Vol. 47, No. 1, pp. 116-128, 2015.   DOI
7 G. Li, D. Deng, J. Wang, and J. Feng, "Pass-join: A partition based method for similarity joins," PVLDB, Vol. 5, pp. 253-264, 2011.
8 J.Qin, W.Wang, Y.Lu, C.Xiao, and X.Lin, "Efficient exact edit similarity query processing with asymmetric signature schemes," Proc. of the ACM SIGMOD Conference, pp. 1033-1044, 2011.
9 C. Li, B. Wang, and X. Yang, "VGRAM: Improving performance of approximate queries on string collections using variable-length grams," VLDB, pp. 303-314, 2007.
10 J. Kim, C. Li, and X. Xie, "Improving read mapping using additional prefix grams," BMC Bioinformatics, 15(1):42, 2014.   DOI