[KSCI] Korea Science Citation Index Service

http://dx.doi.org/10.1633/JISTaP.2021.9.2.1

Word Embeddings-Based Pseudo Relevance Feedback Using Deep Averaging Networks for Arabic Document Retrieval

Farhan, Yasir Hadi (Faculty of Information Science and Technology, Universiti Kebangsaan Malaysia)
Noah, Shahrul Azman Mohd (Faculty of Information Science and Technology, Universiti Kebangsaan Malaysia)
Mohd, Masnizah (Faculty of Information Science and Technology, Universiti Kebangsaan Malaysia)
Atwan, Jaffar (Prince Abdullah Bin Ghazi, Faculty of Information Technology, Al Balqa Applied University)

Publication Information

Journal of Information Science Theory and Practice / v.9, no.2, 2021 , pp. 1-17 More about this Journal

Abstract

Pseudo relevance feedback (PRF) is a powerful query expansion (QE) technique that prepares queries using the top k pseudorelevant documents and choosing expansion elements. Traditional PRF frameworks have robustly handled vocabulary mismatch corresponding to user queries and pertinent documents; nevertheless, expansion elements are chosen, disregarding similarity to the original query's elements. Word embedding (WE) schemes comprise techniques of significant interest concerning QE, that falls within the information retrieval domain. Deep averaging networks (DANs) defines a framework relying on average word presence passed through multiple linear layers. The complete query is understandably represented using the average vector comprising the query terms. The vector may be employed for determining expansion elements pertinent to the entire query. In this study, we suggest a DANs-based technique that augments PRF frameworks by integrating WE similarities to facilitate Arabic information retrieval. The technique is based on the fundamental that the top pseudo-relevant document set is assessed to determine candidate element distribution and select expansion terms appropriately, considering their similarity to the average vector representing the initial query elements. The Word2Vec model is selected for executing the experiments on a standard Arabic TREC 2001/2002 set. The majority of the evaluations indicate that the PRF implementation in the present study offers a significant performance improvement compared to that of the baseline PRF frameworks.

Keywords

automatic query expansion; information retrieval; word embedding; deep averaging networks; pseudo relevance feedback; Arabic document retrieval on TREC collection;

Citations & Related Records

Reference

1	Croft, W. B., Metzler, D., & Strohman, T. (2010). Search engines: Information retrieval in practice. Addison-Wesley.
2	Darwish, K., & Mubarak, H. (2016, May 23-28). Farasa: A new fast and accurate Arabic word segmenter. In N. Calzolari, K. Choukri, T. Declerck, S. Goggi, M. Grobelnik, B. Maegaard, J. Mariani, H. Mazo, A. Moreno, J. Odijk, & S. Piperidis (Eds.) Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC'16) (pp. 1070-1074). European Language Resources Association.
3	ALMasri, M., Berrut, C., & Chevallet, J.-P. (2016, March 20-23). A comparison of deep learning based query expansion with pseudo-relevance feedback and mutual information. In N. Ferro, F. Crestani, M.-F. Moens, J. Mothe, F. Silvestri, G. M. Di Nunzio, C. Hauff, & G. Silvello (Eds.), Proceedings of the 38th European Conference on IR Research (pp. 709-715). Springer. https://doi.org/10.1007/978-3-319-30671-1_57. DOI
4	Alsmearat, K., Al-Ayyoub, M., & Al-Shalabi, R. (2014, November 10-13). An extensive study of the Bag-of-Words approach for gender identification of Arabic articles. In A. Bouras, Z. Tari, A. Erradi, & S. Abdelwahed (Eds.), Proceedings of the 2014 IEEE/ACS 11th International Conference on Computer Systems and Applications (pp. 601-608). IEEE. https://doi.org/10.1109/AICCSA.2014.7073254. DOI
5	Azad, H. K., & Deepak, A. (2019). Query expansion techniques for information retrieval: A survey. Information Processing & Management, 56(5), 1698-1735. https://doi.org/10.1016/j.ipm.2019.05.009. DOI
6	Fernandez-Reyes, F. C., Hermosillo-Valadez, J., & Montes-y-Gomez, M. (2018). A prospect-guided global query expansion strategy using word embeddings. Information Processing & Management, 54(1), 1-13. https://doi.org/10.1016/j.ipm.2017.09.001. DOI
7	Esposito, M., Damiano, E., Minutolo, A., De Pietro, G., & Fujita, H. (2020). Hybrid query expansion using lexical resources and word embeddings for sentence retrieval in question answering. Information Sciences, 514, 88-105. https://doi.org/10.1016/j.ins.2019.12.002. DOI
8	Farghaly, A., & Shaalan, K. (2009). Arabic natural language processing: Challenges and solutions. ACM Transactions on Asian Language Information Processing, 8(4), 14. https://doi.org/10.1145/1644879.1644881. DOI
9	Farhan, Y. H., Noah, S. A. M., & Mohd, M. (2020). Survey of automatic query expansion for arabic text retrieval. Journal of Information Science Theory and Practice, 8(4), 67-86. https://doi.org/10.1633/JISTaP.2020.8.4.6. DOI
10	Franco-Salvador, M., Rangel, F., Rosso, P., Taule, M., & Martit, M. A. (2015, September 8-11). Language variety identification using distributed representations of words and documents. In J. Mothe, J. Savoy, J. Kamps, K. Pinel-Sauvagnat, G. Jones, E. San Juan, L. Capellato, & N. Ferro (Eds.), Proceedings of the 6th International Conference of the CLEF Association, CLEF'15 (pp. 28-40). Springer. https://doi.org/10.1007/978-3-319-24027-5_3. DOI
11	Turney, P. D., & Pantel, P. (2010). From frequency to meaning: Vector space models of semantics. Journal of Artificial Intelligence Research, 37(1), 141-188. DOI
12	Diaz, F., Mitra, B., & Craswell, N. (2016, August 7-12). Query expansion with locally-trained word embeddings. In K. Erk, & N. A. Smith (Eds.), Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (pp. 367-377). Association for Computational Linguistics. https://doi.org/10.18653/v1/P16-1035. DOI
13	El Mahdaouy, A., El Alaoui, S. O., & Gaussier, E. (2019). Word-embedding-based pseudo-relevance feedback for Arabic information retrieval. Journal of Information Science, 45(4), 429-442. https://doi.org/10.1177%2F0165551518792210. DOI
14	Fang, H., & Zhai, C. (2006, August 6-11). Semantic term matching in axiomatic approaches to information retrieval. In S. Dumais, E. N. Efthimiadis, D. Hawking, & K. Jarvelin (Eds.), Proceedings of the SIGIR '06: 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 115-122). Association for Computing Machinery. https://doi.org/10.1145/1148170.1148193. DOI
15	Pal, D., Mitra, M., & Datta, K. (2014). Improving query expansion using WordNet. Journal of the Association for Information Science and Technology, 65(12), 2469-2478. https://doi.org/10.1002/asi.23143. DOI
16	Roy, D., Paul, D., Mitra M., & Garain, U. (2016). Using word embeddings for automatic query expansion. Paper presented at the Neu-IR '16 SIGIR Workshop on Neural Information Retrieval, Pisa, Italy.
17	Larkey, L. S., Ballesteros, L., & Connell, M. E. (2002, August 11-15). Improving stemming for Arabic information retrieval: Light stemming and co-occurrence analysis. In K. Jarvelin, R. Baeza-Yates, & S. H. Myaeng (Eds.), Proceedings of the SIGIR '02: 25th annual international ACM SIGIR conference on Research and development in information retrieval (pp. 275-282). Association for Computing Machinery. https://doi.org/10.1145/564376.564425. DOI
18	Ganguly, D., Roy, D., Mitra, M., & Jones, G. J. F. (2015, August 9-13). Word embedding based generalized language model for information retrieval. In R. Gonzalez-Ibanez, & N. Hidalgo (Eds.), Proceedings of the SIGIR '15: 38th International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 795-798). Association for Computing Machinery. https://doi.org/10.1145/2766462.2767780. DOI
19	Iyyer, M., Manjunatha, V., & Daume, H., III. (2015, July 26-31). Deep unordered composition rivals syntactic methods for text classification. In C. Zong, & M. Strube (Eds.), Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (pp. 1681-1691). Association for Computational Linguistics. https://doi.org/10.3115/v1/P15-1162. DOI
20	Kim, H. K., Kim, H., & Cho, S. (2017). Bag-of-concepts: Comprehending document representation through clustering words in distributed representation. Neurocomputing, 266, 336-352. https://doi.org/10.1016/j.neucom.2017.05.046. DOI
21	Lavrenko, V., & Croft, W. B. (2017). Relevance-based language models. ACM SIGIR Forum, 51(2), 260-267. https://doi.org/10.1145/3130348.3130376. DOI
22	Manning, C. D., Raghavan, P., & Schutze, H. (2008). Introduction to information retrieval. Cambridge University Press.
23	Carpineto, C., & Romano, G. (2012). A survey of automatic query expansion in information retrieval. ACM Computing Surveys, 44(1), 1. https://doi.org/10.1145/2071389.2071390. DOI
24	Ben Guirat, S., Bounhas, I., & Slimani, Y. (2016). Combining indexing units for Arabic information retrieval. International Journal of Software Innovation, 4(4), 1-14. https://doi.org/10.4018/IJSI.2016100101. DOI
25	Bengio, Y. (2009). Learning deep architectures for AI. Foundations and Trends^® in Machine Learning, 2(1), 1-127. https://doi.org/10.1561/2200000006. DOI
26	Carpineto, C., De Mori, R., Romano, G., & Bigi, B. (2001). An information-theoretic approach to automatic query expansion. ACM Transactions on Information Systems, 19(1), 1-27. https://doi.org/10.1145/366836.366860. DOI
27	Crimp, R., & Trotman, A. (2018, December 11-12). Refining query expansion terms using query context. In B. Koopman, A. Trotman, & P. Thomas (Eds.), Proceedings of the ADCS '18: 23rd Australasian Document Computing Symposium (article no.: 12). Association for Computing Machinery. https://doi.org/10.1145/3291992.3292000. DOI
28	Amati, G., & Van Rijsbergen, C. J. (2002). Probabilistic models of information retrieval based on measuring the divergence from randomness. ACM Transactions on Information Systems, 20(4), 357-389. https://doi.org/10.1145/582415.582416. DOI
29	Dalton, J., Naseri, S., Dietz, L., & Allan, J. (2019, April 14-18). Local and global query expansion for hierarchical complex topics. In L. Azzopardi, B. Stein, N. Fuhr, P. Mayr, C. Hauff, & D. Hiemstra (Eds.), Proceedings of the 41st European Conference on IR Research, ECIR 2019 (pp. 290-303). Springer. https://doi.org/10.1007/978-3-030-15712-8_19. DOI
30	Aklouche, B., Bounhas, I., & Slimani, Y. (2018, November 14-16). Query expansion based on NLP and word embeddings. Paper presented at the TREC 2018, Gaithersburg, MD, USA.
31	Belkin, N. J., Oddy, R. N., & Brooks, H. M. (1982). Ask for information retrieval: Part II. Results of a design study. Journal of Documentation, 38(3), 145-164. https://doi.org/10.1108/eb026726. DOI
32	Berget, G., & Sandnes, F. E. (2015). Searching databases without query-building aids: Implications for dyslexic users. Information Research: An International Electronic Journal, 20(4), 689.
33	Clinchant, S., & Gaussier, E. (2013, September 29-October 2). A theoretical analysis of pseudo-relevance feedback models. In O. Kurland, D. Metzler, C. Lioma, B. Larsen, & P. Ingwersen (Eds.), Proceedings of the ICTIR '13: International Conference on the Theory of Information Retrieval (pp. 6-13). Association for Computing Machinery. https://doi.org/10.1145/2499178.2499179. DOI
34	Robertson, S. E., Walker, S., Jones, S., Hancock-Beaulieu, M. M., & Gatford, M. (1995). Okapi at TREC-3. Paper presented at the 3rd Text REtrieval Conference (TREC-3), Gaithersburg, MD, USA.
35	Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013a). Efficient estimation of word representations in vector space. https://arxiv.org/abs/1301.3781v3.
36	Darwish, K., & Ali, A. (2012, July 8-14). Arabic retrieval revisited: Morphological hole filling. In H. Li, C.-Y. Lin, M. Osborne, G. G. Lee, & J. C. Park (Eds.), Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (pp. 218-222). ACL.
37	Mukherjee, S., & Kumar, N. S. (2019, December 9-11). Duplicate question management and answer verification system. In M. Chang, R. Rajendran, Kinshuk, S. Murthy, & V. Kamat (Eds.), Proceedings of the 2019 IEEE Tenth International Conference on Technology for Education (pp. 266-267). IEEE. https://doi.org/10.1109/T4E.2019.00067. DOI
38	Mustafa, M., AbdAlla, H., & Suleman, H. (2008, December 2-5). Current approaches in Arabic IR: A survey. In G. Buchanan, M. Masoodian, & S. J. Cunningham (Eds.), Proceedings of the 11th International Conference on Asian Digital Libraries, ICADL 2008 (pp. 406-407). Springer. https://doi.org/10.1007/978-3-540-89533-6_57. DOI
39	Pennington, J., Socher, R., & Manning, C. (2014, October 25-29). GloVe: global vectors for word representation. In Y. Marton (Ed.), Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (pp. 1532-1543). Association for Computational Linguistics. https://doi.org/10.3115/v1/D14-1162. DOI
40	Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., & Kuksa, P. (2011). Natural language processing (almost) from scratch. Journal of Machine Learning Research, 12(76), 2493-2537.
41	Vaidyanathan, R., Das, S., & Srivastava, N. (2015). A study on retrieval models and query expansion using PRF. International Journal of Scientific & Engineering Research, 6(2), 13-18.
42	Zamani, H., & Croft, W. B. (2016, September 12-16). Embedding-based query language models. In B. Carterette, & H. Fang (Eds.), Proceedings of the ICTIR '16: 2016 ACM International Conference on the Theory of Information Retrieval (pp. 147-156). Association for Computing Machinery. https://doi.org/10.1145/2970398.2970405. DOI
43	Miyanishi, T., Seki, K., & Uehara, K. (2013, October 27-November 1). Improving pseudo-relevance feedback via tweet selection. In Q. He, A. Iyengar, W. Nejdl, J. Pei, & R. Rastogi (Eds.), Proceedings of the CIKM '13: 22nd ACM international conference on Information & Knowledge Management (pp. 439-448). Association for Computing Machinery. https://doi.org/10.1145/2505515.2505701. DOI
44	Mohsen, G., Al-Ayyoub, M., Hmeidi, I., & Al-Aiad, A. (2018, April 3-5). On the automatic construction of an Arabic thesaurus. In M. Quwaider (Ed.), Proceedings of the 2018 9th International Conference on Information and Communication Systems (pp. 243-247). IEEE. https://doi.org/10.1109/IACS.2018.8355431. DOI
45	Atwan, J., Mohd, M., Rashaideh, H., & Kanaan, G. (2016). Semantically enhanced pseudo relevance feedback for Arabic information retrieval. Journal of Information Science, 42(2), 246-260. https://doi.org/10.1177%2F0165551515594722. DOI
46	Faqeeh, M., Abdulla, N., Al-Ayyoub, M., Jararweh, Y., & Quwaider, M. (2014, August 27-29). Cross-lingual short-text document classification for Facebook comments. In M. Younas, I. Awan, & A. Pescape (Eds.), Proceedings of the FiCloud 2014: 2nd International Conference on Future Internet of Things and Cloud (pp. 573-578). IEEE. https://doi.org/10.1109/FiCloud.2014.99. DOI
47	Abbache, A., Meziane, F., Belalem, G., & Belkredim, F. Z. (2016). Arabic query expansion using WordNet and association rules. International Journal of Intelligent Information Technologies, 12(3), 51-64. http://doi.org/10.4018/IJIIT.2016070104. DOI
48	Abu El-Khair, I. (2007). Arabic information retrieval. Annual Review of Information Science and Technology, 41(1), 505-533. https://doi.org/10.1002/aris.2007.1440410118. DOI
49	Takeuchi, S., Sugiura, K., Akahoshi, Y., & Zettsu, K. (2017). Spatio-temporal pseudo relevance feedback for scientific data retrieval. IEEJ Transactions on Electrical and Electronic Engineering, 12(1), 124-131. https://doi.org/10.1002/tee.22352. DOI
50	Trotman, A., Puurula, A., & Burgess, B. (2014, November 27-28). Improvements to BM25 and language models examined. In J. Culpepper, L. Park, & G. Zuccon (Eds.), Proceedings of the ADCS '14: 2014 Australasian Document Computing Symposium (pp. 58-65). Association for Computing Machinery. https://doi.org/10.1145/2682862.2682863. DOI
51	Montazeralghaem, A., Zamani, H., & Shakery, A. (2016, July 17-21). Axiomatic analysis for improving the log-logistic feedback model. In R. Perego, F. Sebastiani, J. Aslam, I. Ruthven, & J. Zobel (Eds.), Proceedings of the SIGIR '16: 39th International ACM SIGIR conference on Research and Development in Information Retrieval (pp. 765-768). Association for Computing Machinery. https://doi.org/10.1145/2911451.2914768. DOI
52	Xue, B., Fu, C., & Shaobin, Z. (2014, June 27-July 2). A study on sentiment computing and classification of Sina Weibo with Word2vec. In P. Chen, & H. Jain (Eds.), Proceedings of the 2014 IEEE International Congress on Big Data (pp. 358-363). IEEE. https://doi.org/10.1109/BigData.Congress.2014.59. DOI
53	Zuccon, G., Koopman, B., Bruza, P., & Azzopardi, L. (2015, December 8-9). Integrating and evaluating neural word embeddings in information retrieval. In L. A. F. Park, & S. Karimi (Eds.), Proceedings of the ADCS '15: 20th Australasian Document Computing Symposium (article no.: 12). Association for Computing Machinery. https://doi.org/10.1145/2838931.2838936. DOI
54	Mikolov, T., Sutskever, I., Chen, K., Corrado, G., & Dean, J. (2013b, December 5-10). Distributed representations of words and phrases and their compositionality. In C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, & K. Q. Weinberger (Eds.), Proceedings of the NIPS'13: 26th International Conference on Neural Information Processing Systems (pp. 3111-3119). Curran Associates.