DOI QR코드

DOI QR Code

A Space-Efficient Inverted Index Technique using Data Rearrangement for String Similarity Searches

유사도 검색을 위한 데이터 재배열을 이용한 공간 효율적인 역 색인 기법

  • 임마누 (전북대학교 컴퓨터공학부) ;
  • 김종익 (전북대학교 컴퓨터공학부)
  • Received : 2015.07.23
  • Accepted : 2015.09.08
  • Published : 2015.10.15

Abstract

An inverted index structure is widely used for efficient string similarity search. One of the main requirements of similarity search is a fast response time; to this end, most techniques use an in-memory index structure. Since the size of an inverted index structure usually very large, however, it is not practical to assume that an index structure will fit into the main memory. To alleviate this problem, we propose a novel technique that reduces the size of an inverted index. In order to reduce the size of an index, the proposed technique rearranges data strings so that the data strings containing the same q-grams can be placed close to one other. Then, the technique encodes those multiple strings into a range. Through an experimental study using real data sets, we show that our technique significantly reduces the size of an inverted index without sacrificing query processing time.

유사도 검색에서는 효율적으로 유사성을 만족하는 문자열을 찾기 위해서 데이터에 대한 역 색인을 구축하여 이용한다. 일반적으로 기존의 기법들은 빠른 응답속도의 질의처리를 위해서 역 색인을 메모리에 상주시킨다. 하지만 구축된 역 색인은 그 크기가 매우 크다는 문제점을 가지고 있다. 따라서 데이터의 크기가 매우 큰 경우나 자원이 제약적인 환경에서는 역 색인을 이용한 질의처리가 불가능할 수 있다. 본 논문에서는 동일한 q-그램을 포함하는 문자열들이 서로 인접한 위치가 되도록 재배치시킨 후 해당 문자열들을 범위로 표현한다. 실험을 통하여 질의처리의 성능을 희생하지 않으면서도 색인의 크기가 줄어드는 것을 보인다.

Keywords

Acknowledgement

Supported by : 한국연구재단

References

  1. C. Xiao, W. Wang, X. Lin, and J. X. Yu, "Efficient similarity joins for near-duplicate detection," Proc. of the 17th international conference on World Wide Web, pp. 131-140, 2008. (in USA)
  2. S. Sarawagi, A. Kirpal, "Efficient set joins on similarity predicates," Proc. of the 2004 ACM SIGMOD international conference on Management of data, pp. 743-754, 2004. (in USA)
  3. S. Chaudhuri, V. Ganti, R. Kaushik, "A primitive operator for similarity joins in data cleaning," Proc. of the IEEE 22nd International Conference on Data Engineering, pp. 0-5, 2006.
  4. X. Yang, B. Wang, and C. Li, "Cost-Based Variable-Length-Gram Selection for String Collections to Support Approximate Queries Efficiently," Proc. of the 2008 ACM SIGMOD international conference on Management of data, pp. 353-364, 2008. (in USA)
  5. C. Li, B. Wang, and X. Yang, "VGRAM: Improving Performance of Approximate Queries on String Collections Using Variable-Length Grams," Proc. of the 33rd international conference on Very large data bases, pp. 303-314, 2007.
  6. C. Xiao, W. Wang, X. Lin, "Ed-Join: an efficient algorithm for similarity joins with edit distance constraints," Proc. of the VLDB Endowment VLDB Endowment Hompage archive Volume 1 Issue 1, August 2008, pp. 933-944, 2008. https://doi.org/10.14778/1453856.1453957
  7. J. Kim, H. Lee, "Efficient exact similarity searches using multiple token orderings," proc. of the IEEE 28th International Conference on Data Engineering, pp. 822-833, 2012.
  8. J. Kim, "An effective candidate generation method for improving performance of edit similarity query processing," Information Systems, pp. 116-128, 2015.
  9. V. N. Ang and A. Moffat, "Inverted Index Compression Using Word-Aligned Binary Codes," Information Retrieval, pp. 151-166, 2005.
  10. H. Williams, J. Zobel, "Compressing Integers for Fast File Access," The Computer Journal, pp. 193-201, 1999.
  11. M. Zukowski, S. Heman, N. Nes, P. Boncz, "Super-Scalar RAM-CPU Cache Compression," proc. of the IEEE 22nd International Conference on Data Engineering, pp. 59, 2006.
  12. A. Clauset, C. Shalizi, M. Newman, "Power-law distributions in empirical data," SIAM Review, pp. 661-703, 2009.