[KSCI] Korea Science Citation Index Service

http://dx.doi.org/10.1633/JISTaP.2015.3.1.2

Query Formulation for Heuristic Retrieval in Obfuscated and Translated Partially Derived Text

Kumar, Aarti (Department of Computer Applications Maulana Azad National Institute of Technology)
Das, Sujoy (Department of Computer Applications Maulana Azad National Institute of Technology)

Publication Information

Journal of Information Science Theory and Practice / v.3, no.1, 2015 , pp. 24-39 More about this Journal

Abstract

Pre-retrieval query formulation is an important step for identifying local text reuse. Local reuse with high obfuscation, paraphrasing, and translation poses a challenge of finding the reused text in a document. In this paper, three pre-retrieval query formulation strategies for heuristic retrieval in case of low obfuscated, high obfuscated, and translated text are studied. The strategies used are (a) Query formulation using proper nouns; (b) Query formulation using unique words (Hapax); and (c) Query formulation using most frequent words. Whereas in case of low and high obfuscation and simulated paraphrasing, keywords with Hapax proved to be slightly more efficient, initial results indicate that the simple strategy of query formulation using proper nouns gives promising results and may prove better in reducing the size of the corpus for post processing, for identifying local text reuse in case of obfuscated and translated text reuse.

Keywords

Heuristic; obfuscated; translated; simulated paraphrasing; retrieval; Hapax; query formulation; pre-retrieval;

Citations & Related Records

Reference

1	Torrejon, D.A.R., & Ramos, J.M.M. (2013). Linking English and Hindi news by IDF, reference monotony and extended contextual N-grams IR engine. In FIRE 2013 Working Notes.
2	Vogel, S., Ney, H., & Tillmann, C. (1996). HMM-based word alignment in statistical translation. Proc. 16th conference on Computational linguistics (COLING ‘96), Association for Computational Linguistics (vol. 2, pp. 836-841). doi:10.3115/993268.993313 DOI
3	Palkovskii, Y., & Belov, A. (2011). Using TF-IDF weight ranking model in CLINSS as effective similarity measure to identify cases of journalistic text reuse. Berlin Heidelberg: Springer-Verlag.
4	Tholpadi, G., & Param, A. (2013). Leveraging article titles for cross-lingual linking of focal news events. In FIRE 2013 Working Notes.
5	Pal, A., & Gillam, L. Set-based similarity measurement and ranking model to identify cases of journalistic text reuse. In FIRE 2013 Working Notes.
6	Palkovskii, Y., Muzyka, I., & Belov, A. (2012). Detecting text reuse with ranged windowed TF-IDF analysis method. Retrieved from http://www.plagiarismadvice.org/research-papers/item/detecting-textreuse-with-ranged-windowed-tf-idf-analysis-method
7	Possas, B., Ziviani, N., Ribeiro-Neto, B., et al. (2005, October-November). Maximal termsets as a query structuring mechanism. Paper presented at CIKM’05, Bremen, Germany. ACM 1595931406/05/0010.
8	Potthast, M., Hagen, M., Gollub, T., et al. (2013, September). Overview of the 5th International Competition on plagiarism detection. Working notes paper presented at CLEF 2013 Evaluation Labs and Workshop, Valencia, Spain.
9	Potthast, M., Hagen, M., Völske, M., et al. (2013, August). Crowdsourcing interaction logs to understand text reuse from the web. Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Vol. 1: Long Papers) (pp. 1212-1221).
10	Pouliquen, B., et. al. (2003). Automatic identification of document translations in large multilingual document collections. Proc. International Conference Recent Advances in Natural Language Processing (RANLP ‘03), pp. 401-408.
11	Seo, J., & Croft, W.B. (2008, July). Local text reuse detection. Paper presented at SIGIR’08, Singapore. ACM 978-1-60558-164-4/08/07.
12	Gupta, P., & Singhal, K. (2011). Mapping Hindi-English text re-use document pairs. In FIRE 2011 Working Notes.
13	Gustafson, N., & Soledad, M., et al. (2008). Nowhere to hide: Finding plagiarized documents based on sentence similarity. Paper presented at 2008 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology, Provo, Utah, USA. 978-0-7695-3496-1/08 IEEE, DOI 10.1109/WIIAT.2008.16. DOI
14	Osman, A.H., et al. (2012). An improved plagiarism detection scheme based on semantic role labeling. Journal of Applied Soft Computing, 12, 1493-1502. doi:10.1016/j.asoc.2011.12.021 DOI ScienceOn
15	Gupta, P., & Rosso, P. (2012, July). Text reuse with ACL (Upward) trends. Proceedings of the ACL-2012 Special Workshop on Rediscovering 50 Years of Discoveries (pp. 76–82), Jeju, Korea.
16	Hagen, M., & Stein, B. (2010). Candidate document retrieval for web-scale text reuse detection? Extended version of an ECDL 2010 poster paper. M. Hagen & B. Stein. Capacity-constrained query formulation. Proc. of ECDL 2010 (posters) (pp. 384–388).
17	Haiduc, S., et al. (2013). Automatic query reformulations for text retrieval in software engineering. Paper presented at 2013 IEEE ICSE 2013, San Francisco, CA, USA.
18	Carmel, D., et al. (2006, August). What makes a query difficult? Paper presented at SIGIR’06, Seattle, Washington, USA.
19	Hauff, C., Hiemstra, D., & Jong, F. (2008, October). A survey of pre-retrieval query performance predictors. Paper presented at CIKM’08, Napa Valley, CA, USA. ACM 978-1-59593-991-3/08/10.
20	Mittelbach, A., Lehmann, L., Rensing, C., et al. (2010, September). Automatic detection of local reuse. Proceedings of the 5th European Conference on Technology Enhanced Learning no. LNCS 6383 (pp. 229-244). Berlin Heidelberg: Springer-Verlag.
21	Bar, D, Zesch, T., and Gurevych, I. (2012, December). Text reuse detection using a composition of text similarity measures. Proceedings of COLING 2012: Technical Papers (pp. 167-184), COLING 2012, Mumbai.
22	Barrón-Cedeño, A. (2010, July). On the mono- and cross-language. Detection of text re-use and plagiarism. Paper presented at SIGIR’10, Geneva, Switzerland. ACM 978-1-60558-896-4/10/07.
23	Grozea, C., & Popescu, M. (2009). ENCOPLOT: Pairwise sequence matching in linear time applied to plagiarism detection. In B. Stein, P. Rosso, E. Stamatatos, M. Koppel, & E. Agirre (Eds.), PAN’09 (pp. 10-18), Donostia, Spain.
24	Arora, P., Jones, J., & Jones, G.J.F. (2013). DCU at FIRE 2013. Cross-Language Indian news story search. In FIRE 2013 Working Notes.
25	Barrón-Cedeño, A. (2012). On the mono- and cross-language detection of text re-use and plagiarism. Ph.D. thesis. Universitat Politecnica de Valencia, Spain.
26	Clough, P.D., Gaizauskas, R., Piao, S.S.L., et al. (2002, July). METER: Measuring text reuse. Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL) (pp. 152-159), Philadelphia.
27	Clough, P.D. (2001). Measuring text reuse in journalistic domain. Proeeedings of the 4th CLUK Colloquium. (pp. 53–63), UK.
28	Cummins, R., Jose, J., & O’Riordan, C. (2011, July). Improved query performance prediction using standard deviation. Paper presented at SIGIR’11, Beijing, China. ACM 978-1-4503-0757-4/11/07.
29	Ghosh, A., Pal, S., & Bandyopadhyay, S. (2011). Cross-language text re-use detection using information retrieval. In FIRE 2011 Working Notes.
30	Gipp, B., et al. (2013, July-August). Demonstration of citation pattern analysis for plagiarism detection. Paper presented at SIGIR’13, Dublin, Ireland. ACM 978-1-4503-2034-4/13/07
31	Aggarwal, N., Asooja, K., Buitelaar, P., Polajnar, T.,& Gracia, J. (2012). Cross-lingual linking of news stories using ESA. In FIRE 2012Working Notes for CL!NSS, FIRE ISI, Kolkata, India(2012)