[KSCI] Korea Science Citation Index Service

http://dx.doi.org/10.4218/etrij.2019-0458

Building a text collection for Urdu information retrieval

Rasheed, Imran (Department of Computer Science and Engineering, Indian Institute of Technology (ISM))
Banka, Haider (Department of Computer Science and Engineering, Indian Institute of Technology (ISM))
Khan, Hamaid M. (Aluteam, Fatih Sultan Mehmet Vakif University)

Publication Information

ETRI Journal / v.43, no.5, 2021 , pp. 856-868 More about this Journal

Abstract

Urdu is a widely spoken language in the Indian subcontinent with over 300 million speakers worldwide. However, linguistic advancements in Urdu are rare compared to those in other European and Asian languages. Therefore, by following Text Retrieval Conference standards, we attempted to construct an extensive text collection of 85 304 documents from diverse categories covering over 52 topics with relevance judgment sets at 100 pool depth. We also present several applications to demonstrate the effectiveness of our collection. Although this collection is primarily intended for text retrieval, it can also be used for named entity recognition, text summarization, and other linguistic applications with suitable modifications. Ours is the most extensive existing collection for the Urdu language, and it will be freely available for future research and academic education.

Keywords

assessors agreement; relevance judgment; text collection construction and evaluation; Urdu corpus; Urdu information retrieval;

Citations & Related Records

Reference

1	D. Becker and K. Riaz, A study in Urdu corpus construction, in Proc. Workshop Asian Lang. Resour. Int. Stand. vol. 12, (Stroudsburg, PA, USA), Aug. 2002, pp. 1-5.
2	S. Urooj et al., Cle Urdu digest corpus, in Proc. Conf. Lang. Technol. (SNLP), (Lahore, Pakistan), (2012), pp. 47-53.
3	F. Baseer, A. Habib, and J. Ashraf, Romanized Urdu corpus development (rucd) model: Edit-distance based most frequent unique unigram extraction approach using real-time interactive dataset, in Proc. Int. Conf. Innov. Comput. Technol. (INTECH), (Dublin, Ireland), Aug. 2016, pp. 513-518.
4	Q. Abbas, Building a hierarchical annotated corpus of Urdu: The Urdu. kon-tb treebank, in International Conference on Intelligent Text Processing and Computational Linguistics, Springer, Berlin, Germany, 2012, pp. 66-79.
5	M. Ijaz and S. Hussain, Corpus based Urdu lexicon development, in Proc. Conf. Lang. Technol. (CLT07), vol. 73, (Peshawar, Pakistan), Aug. 2007.
6	M. Karthikeyan and P. Aruna, Probability based document clustering and image clustering using content-based image retrieval, Appl. Soft Comp. 13 (2013), no. 2, 959 -966. DOI
7	M. Humayoun et al., Urdu summary corpus, in Proc. Int. Conf. Lang. Resour. Eval. (Reykjavik, Iceland), May 2014, pp. 796-800, https://github.com/humsh a/USCorpus
8	Q. A. Akram, A. Naseer, and S. Hussain, Assasband, an affix-exception-list based Urdu stemmer, in Proc. Workshop Asian Lang. Resour. (Suntec, Singapore), Aug. 2009, pp. 40-47,
9	I. Rasheed et al., Urdu text classification: A comparative study using machine learning techniques, in Proc. Int. Conf. Digit. Inf. Manag. (ICDIM) (Berlin, Germany), Sept. 2018, pp. 274-278.
10	A. AleAhmad et al., Hamshahri: A standard persian text collection, Knowl. Based Syst. 22 (2009), no. 5, 382 -387. DOI
11	I. Rasheed and H. Banka, Query expansion in information retrieval for Urdu language, in Proc. Int. Conf. Inf. Retr. Knowl. Manag. (CAMP), (Kota Kinabalu, Malaysia), Mar. 2018, pp. 171-176.
12	I. Rasheed, H. Banka, and H. M. Khan, Pseudo-relevance feedback based query expansion using boosting algorithm, Artif. Intell. Rev. (2021), https://doi.org/10.1007/s10462-021-09972-4 DOI
13	S. Hussain, Resources for Urdu language processing, in Proc. Workshop Asian Lang. Resour. IJCNLP, (Hyderabad, India), Jan. 2008, pp. 99-100, https://www.aclweb.org/anthology/I08-7017.pdf
14	A. Kanapala and S. Pal, Test collection for legal ir from online discussion forums, in Proc. Forum Inf. Retr. Eval. (Bangalore, India), Dec. 2014, pp. 126-129.
15	I. Ounis et al., Terrier information retrieval platform, in Advances in Information Retrieval, vol. 3408, Springer, Berlin, Germany, 2005, pp. 517-519.
16	E. M. Voorhees, Overview of trec 2003, in Proc. Text Retr. Conf. (TREC), (Gaithersburg, MD, USA), Nov. 2003, pp. 1-13, https://tsapps.nist.gov/publication/get_pdf.cfm?pub_id=150467
17	S. E. Robertson et al., Okapi at trec-4, in Proc. Text REtrieval Conf. (London, UK), Oct. 1996, pp. 73-96, http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.33.3342
18	C. D. Manning and H. Schutze, Foundations of Statistical Natural Language Processing, vol. 999, MIT Press, Cambridge, MA, USA, 1999, https://nlp.stanf ord.edu/fsnlp/.
19	P. Clough and M. Sanderson, Evaluating the performance of information retrieval systems using test collections, Inf. Res, 18 (2013), no. 2.
20	A. K. McCallum, Mallet: A machine learning for language toolkit, 2002, http://mallet.cs.umass.edu/.
21	R. Rahimi, A. Shakery, and I. King, Extracting translations from comparable corpora for cross-language information retrieval using the language modeling framework, Inf. Process. Manage, 52 (2016), no. 2, 299 -318. DOI
22	M. Humayoun, H. Hammarstrom, and A. Ranta, Urdu morphology, orthography and lexicon extraction, M.S. thesis, Department of Computer Science and Engineering, Chalmers tekniska hogskola, Goteborg, Sweden, 2006.
23	A. Hardie, Developing a tag-set for automated part-of-speech tagging in Urdu in Proc. Corpus Linguistics (Lancaster, UK), Mar. 2003.
24	P. Baker et al., Corpus data for south asian language processing, in Proc. Workshop South Asian Lang. Process. (EACL), (Budapest, Hungary), Apr. 2003, pp. 1-8.
25	K. Riaz, Baseline for Urdu IR evaluation, in Proc. ACM workshop Improving non english web searching (Napa Valley, CA, USA), Oct. 2008, pp. 97-100.
26	A. Daud, W. Khan, and D. Che, Urdu language processing: A survey, Artif. Intell. Rev. 47 (2017), 279-311. DOI
27	M. Sharjeel, R. M. A. Nawab, and P. Rayson, Counter: Corpus of urdu news text reuse, Lang. Res. Eval. 51 (2017), 777-803. DOI
28	V. Gupta, N. Joshi, and I. Mathur, Design & development of rule based inflectional and derivational Urdu stemmer, in Proc. Int. Conf, Futuristic Trends Comput. Anal. Knowl. Manag. (ABLAZE), (Greater Noida, India), Feb. 2015, pp. 7-12.
29	K. Riaz, Concept search in Urdu, in Proc. PhD workshop Inf. Knowl. Manag. (Napa Valley, CA, USA), Oct. 2008, pp. 33-40.
30	S. A. Ali et al., Salience analysis of news corpus using heuristic approach in Urdu language, Int. J. Comput. Sci. Netw. Secur. (IJCSNS), 16 (2016), no. 4, 28-36.
31	I. Hanif et al., Cross-language Urduenglish (clue) text alignment corpus, in Proc. Working notes CLEF (Toulouse, France), Sept. 2015.
32	Z. Ahmad et al., Urdu nastaleeq optical character recognition, World Acad. Sci., Eng. Technol. 26 (2007), pp. 249-252.
33	G. Salton, A. Wong, and C. S. Yang, A vector space model for automatic indexing, Commun. ACM 18 (1975), no. 11, 613-620. DOI
34	G. Amati and C. J. Van Rijsbergen, Probabilistic models of information retrieval based on measuring the divergence from randomness, ACM Trans. Inf. Syst. (TOIS), 20 (2002), no. 4, 357-389. DOI
35	K. Batri, S. Lakshmi, and B. Sathiyabhama, Trade-off between the number of index-terms and the information retrieval system's performance, Kuwait J. Sci. 44 (2017), no. 4, 49-56.
36	N. Craswell et al., Overview of the trec-2003 web track, in Proc. Text Retr. Conf. (TREC), vol. 3, (Gaithersburg, MD, USA), 2002.
37	J. M. Ponte and W. B. Croft, A language modeling approach to information retrieval, in Proc Int. ACM SIGIR Conf. Res. Dev. Inf Retr. (Melbourne, Australia), Aug. 1998, pp. 275-281.
38	E. Frank et al., Weka-a machine learning workbench for data mining, in Data Mining and Knowledge Discovery Handbook, Springer, Boston, MA, USA, 2009, pp. 1269-1277.
39	I. Haneef et al., Design and development of a large cross-lingual plagiarism corpus for urdu-english language pair, Sci. Program. 2019 (2019), 1-11.
40	L. Cohen, L. Manion, and K. Morrison, The ethics of educational and social research, in Research Methods in Education, 8 th ed., Routledge, London, UK, 2013, https://doi.org/10.4324/9780203720967 DOI
41	W. B. Croft, D. Metzler, and T. Strohmann, Search Engines: Information Retrieval in Practice, Pearson Education, Boston, MA, USA, 2010.
42	T. Zia, M. P. Akhter, and Q. Abbas, Comparative study of feature selection approaches for Urdu text categorization, Malaysian J. Comput. Sci, 28 (2015), no. 2, 93-109.
43	N. Khan, M. P. Bakht, and R. A. Wagan, Corpus construction and structure study of Urdu language using empirical laws, in Proc. Int. Conf. Data Sci. (Karachi, Pakistan), Feb. 2019, pp. 9-14.