Browse > Article
http://dx.doi.org/10.4218/etrij.16.0115.0594

Fast, Flexible Text Search Using Genomic Short-Read Mapping Model  

Kim, Sung-Hwan (Department of Electrical and Computer Engineering, Pusan National University)
Cho, Hwan-Gue (Department of Electrical and Computer Engineering, Pusan National University)
Publication Information
ETRI Journal / v.38, no.3, 2016 , pp. 518-528 More about this Journal
Abstract
The searching of an extensive document database for documents that are locally similar to a given query document, and the subsequent detection of similar regions between such documents, is considered as an essential task in the fields of information retrieval and data management. In this paper, we present a framework for such a task. The proposed framework employs the method of short-read mapping, which is used in bioinformatics to reveal similarities between genomic sequences. In this paper, documents are considered biological objects; consequently, edit operations between locally similar documents are viewed as an evolutionary process. Accordingly, we are able to apply the method of evolution tracing in the detection of similar regions between documents. In addition, we propose heuristic methods to address issues associated with the different stages of the proposed framework, for example, a frequency-based fragment ordering method and a locality-aware interval aggregation method. Extensive experiments covering various scenarios related to the search of an extensive document database for documents that are locally similar to a given query document are considered, and the results indicate that the proposed framework outperforms existing methods.
Keywords
Text similarity search; document search; short-read mapping; approximate string matching; plagiarism detection;
Citations & Related Records
연도 인용수 순위
  • Reference
1 Y. Yang et al., "Query by Document," ACM Int. Conf. Web Search Data Mining, Barcelona, Spain, Feb. 9-12, 2009, pp. 34-43.
2 M.A. Sanchez-Perez, G. Sidorov, and A. Gelbukh, "A Winning Approach to Text Alignment for Text Reuse Detection at PAN 2014," Notebook PAN CLEF, Sheffield, UK, Sept. 15-18, 2014.
3 C. Trapnell and S.L. Salzberg, "How to Map Billions of Short Reads onto Genomes," Nature Biotechnology, vol. 27, 2009, pp. 455-457.   DOI
4 P. Ferragina and G. Manzini, "Opportunistic Data Structures with Applications," Ann. Symp. Foundations Computer Sci., Redondo Beach, CA, USA, Nov. 12-14, 2000, pp. 390-398.
5 M. Burrows and D.J. Wheeler, "A Block-Sorting Lossless Data Compression Algorithm," Technical Report 124, Digital Equipment Corporation, 1994.
6 U. Manber and G. Myers, "Suffix Arrays: A New Method for Online String Searches," SIAM J. Comput., vol. 22, no. 5, Oct. 1993, pp. 935-948.   DOI
7 H. Li and R. Durbin, "Fast and Accurate Short Read Alignment with Burrows-Wheeler Transform," Bioinformatics, vol. 25, no. 14, 2009, pp. 1754-1760.   DOI
8 M. Potthast et al., "Overview of the 6th International Competition on Plagiarism Detection," Notebook PAN CLEF, Sheffield, UK, Sept. 15-18, 2014.
9 K. Williams, H. Chen, and C. Giles, "Supervised Ranking for Plagiarism Source Retrieval," Notebook PAN CLEF, Sheffield, UK, Sept. 15-18, 2014.
10 M.A. Sanchez-Perez, G. Sidorov, and A. Gelbukh, "A Winning Approach to Text Alignment for Text Reuse Detection at PAN 2014," Notebook PAN CLEF, Sheffield, UK, Sept. 15-18, 2014.
11 S.F. Altschul et al., "Basic Local Alignment Search Tool," J. Molecular Biology, vol. 215, no. 3, Oct. 1990, pp. 403-410.   DOI
12 R. Li et al., "SOAP2: An Improved Ultrafast Tool for Short Read Alignment," Bioinformatics, vol. 25, no. 15, Aug. 2009, pp. 1966-1967.   DOI
13 PAN 2013, Accessed June 19, 2015. http://pan.webis.de
14 P. Ferragina and G. Navarro, Pizza & Chili Corpus, Accessed June 29, 2015. http://pizzachili.dcc.uchile.cl
15 Y. Sun, J. Qin, and W. Wang, "Near Duplicate Text Detection Using Frequency-Biased Signatures," Web Inf. Syst. Eng., Int. Conf., Nanjing, China, Oct. 13-15, 2013, pp. 277-291.
16 C.S. Ock et al., "A Fast Searchong for Similar Text Using Genomc Read Mapping Method," IEEE Int. Conf. Comput. Sci. Eng., Sydney, Australia, Dec. 3-5, 2013, pp. 219-226.
17 S.-H. Kim and H.-G. Cho, "A New Approach for Approximate Text Search Using Genomic Short-Read Mapping Model," ACM Int. Conf. Ubiquitous Inf. Manag. Commun., Bali, Indonesia, Jan. 8-10, 2015.
18 R. Raman, V. Raman, and S.S. Rao, "Succinct Indexable Dictionaries with Applications to Encoding k-ary Trees and Multisets," ACM-SIAM Symp. Discrete Algorithms, San Francisco, CA, USA, Jan. 6-8, 2002, pp.233-242.
19 S. Gog, Succinct Data Structure Library 2.0, Accessed Dec. 1, 2015. https://github.com/simongog/sdsl-lite