DOI QR코드

DOI QR Code

Incorporating Deep Median Networks for Arabic Document Retrieval Using Word Embeddings-Based Query Expansion

  • Yasir Hadi Farhan (Department of Medical Physics, College of Applied Sciences, University of Fallujah) ;
  • Mohanaad Shakir (Department of Management Information System (MIS), College of Business (COB), University of Buraimi (UOB)) ;
  • Mustafa Abd Tareq (Department of Computer Science, University of Technology-Iraq) ;
  • Boumedyen Shannaq (Department of Management Information System (MIS), College of Business (COB), University of Buraimi (UOB))
  • 투고 : 2023.10.10
  • 심사 : 2024.05.09
  • 발행 : 2024.09.30

초록

The information retrieval (IR) process often encounters a challenge known as query-document vocabulary mismatch, where user queries do not align with document content, impacting search effectiveness. Automatic query expansion (AQE) techniques aim to mitigate this issue by augmenting user queries with related terms or synonyms. Word embedding, particularly Word2Vec, has gained prominence for AQE due to its ability to represent words as real-number vectors. However, AQE methods typically expand individual query terms, potentially leading to query drift if not carefully selected. To address this, researchers propose utilizing median vectors derived from deep median networks to capture query similarity comprehensively. Integrating median vectors into candidate term generation and combining them with the BM25 probabilistic model and two IR strategies (EQE1 and V2Q) yields promising results, outperforming baseline methods in experimental settings.

키워드

과제정보

We express our gratitude to the LDC for granting us the LDC2001T55 Arabic Newswire Part 1 without any charges and for presenting us with the LDC Data Scholarship in the autumn of 2012. This study is partially supported by the Universiti Kebangsaan Malaysia grant: DCP-2017-007/4.

참고문헌

  1. Abbache, A., Meziane, F., Belalem, G., & Belkredim, F. Z. (2016). Arabic query expansion using WordNet and association rules. In Information Retrieval and Management: Concepts, Methodologies, Tools, And Applications (pp. 1239-1254). IGI Global. 
  2. Abdelali, A., Darwish, K., Durrani, N., & Mubarak, H. (2016, June 12-17). Farasa: A fast and furious segmenter for Arabic. Paper presented at Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations, San Diego, California. 
  3. Abu El-Khair, I. (2007). Arabic information retrieval. Annual Review of Information Science and Technology, 41(1), 505-533. 
  4. Aklouche, B., Bounhas, I., & Slimani, Y. (2018, November 13-16). Query expansion based on NLP and word embeddings. Paper presented at Text Retrieval Conference (TREC), Gaithersburg, Maryland, USA. 
  5. ALMasri, M., Berrut, C., & Chevallet, J. P. (2016, March 20-23). A comparison of deep learning based query expansion with pseudo-relevance feedback and mutual information. Paper presented at Advances in Information Retrieval: 38th European Conference on IR Research, ECIR 2016, Padua, Italy. 
  6. Alsmearat, K., Al-Ayyoub, M., & Al-Shalabi, R. (2014, May 19- 22). An extensive study of the Bag-Of-Words approach for gender identification of Arabic articles. Paper presented at 2014 IEEE/ACS 11th International Conference on Computer Systems and Applications (AICCSA), Aqaba, Jordan. 
  7. Azad, H. K., & Deepak, A. (2019). Query expansion techniques for information retrieval: A survey. Information Processing & Management, 56(5), 1698-1735. https://doi.org/10.1016/j.ipm.2019.05.009 
  8. Baeza-Yates, R., & Ribeiro-Neto, B. (1999). Modern information retrieval (Vol. 463). ACM press New York. 
  9. Belkin, N. J. (2005). Anomalous state of knowledge. In K. E. F. S. E. Robertson, & E. F. McKechnie (Eds.), Information Research: Theory and Practice (pp. 1-12). American Society for Information Science and Technology. 
  10. Bengio, Y. (2009). Learning deep architectures for AI. Foundations and Trends® in Machine Learning, 2(1), 1-127. http://dx.doi.org/10.1561/2200000006 
  11. Cai, F., & De Rijke, M. (2016). A survey of query auto completion in information retrieval. Foundations and Trends® in Information Retrieval, 10(4), 273-363. http://dx.doi.org/10.1561/1500000055 
  12. Carpineto, C., & Romano, G. (2012). A survey of automatic query expansion in information retrieval. Acm Computing Surveys (CSUR), 44(1), 1-50. https://doi.org/10.1145/2071389.2071390 
  13. Crimp, R., & Trotman, A. (2018, December 10-12). Refining query expansion terms using query context. Paper presented at Proceedings of the 23rd Australasian Document Computing Symposium, Melbourne, Australia. 
  14. Croft, W. B., Metzler, D., & Strohman, T. (2010). Search engines: Information retrieval in practice (Vol. 520). AddisonWesley Reading. 
  15. Darwish, K., & Ali, A. M. (2012, July 8-14). Arabic retrieval revisited: Morphological hole filling. Paper presented at Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short Papers-Volume 2, Jeju, Korea. 
  16. Diaz, F., Mitra, B., & Craswell, N. (2016, August 7-12). Query expansion with locally-trained word embeddings. Paper presented at Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, Berlin, Germany. 
  17. El Mahdaouy, A., El Alaoui, S. O., & Gaussier, E. (2018a). Improving Arabic information retrieval using word embedding similarities. International Journal of Speech Technology, 21(1), 121-136. https://doi.org/10.1007/s10772-018-9492-y 
  18. El Mahdaouy, A., El Alaoui, S. O., & Gaussier, E. (2018b). WordEmbedding-based pseudo-relevance feedback for Arabic information retrieval. Journal of Information Science, 45(4), 429-442. https://doi.org/10.1177/0165551518792210 
  19. Esposito, M., Damiano, E., Minutolo, A., De Pietro, G., & Fujita, H. (2020). Hybrid query expansion using lexical resources and word embeddings for sentence retrieval in question answering. Information Sciences, 514, 88-105. https://doi.org/10.1016/j.ins.2019.12.002 
  20. Faqeeh, M., Abdulla, N., Al-Ayyoub, M., Jararweh, Y., & Quwaider, M. (2014, August 27-29). Cross-Lingual short-text document classification for Facebook comments. Paper presented at 2014 International Conference on Future Internet of Things and Cloud, Vienna, Austria. 
  21. Farhan, Y. H., Mohd, M., & Noah, S. A. M. (2020). Survey of automatic query expansion for Arabic text retrieval. Journal of Information Science Theory and Practice, 8(4), 67-86. https://doi.org/10.1633/JISTaP.2020.8.4.6 
  22. Farhan, Y. H., Mohd Noah, S. A., Mohd, M., & Atwan, J. (2021a). Word-Embedding-based query expansion: Incorporating deep averaging networks in Arabic document retrieval. Journal of Information Science, 49(5), 1168-1186. https://doi.org/10.1177/01655515211040659 
  23. Farhan, Y. H., Noah, S. A. M., Mohd, M., & Atwan, J. (2021b). Word embeddings-based pseudo relevance feedback using deep averaging networks for Arabic document retrieval. Journal of Information Science Theory and Practice, 9(2), 1-17. https://doi.org/10.1633/JISTaP.2021.9.2.1 
  24. Fernandez-Reyes, F. C., Hermosillo-Valadez, J., & Montes-yGomez, M. (2018). A prospect-guided global query expansion strategy using word embeddings. Information Processing & Management, 54(1), 1-13. https://doi.org/10.1016/j.ipm.2017.09.001 
  25. Guirat, S. B., Bounhas, I., & Slimani, Y. (2016). Combining indexing units for Arabic information retrieval. International Journal of Software Innovation (IJSI), 4(4), 1-14. https://doi.org/10.4018/IJSI.2016100101 
  26. Kim, H. K., Kim, H., & Cho, S. (2017). Bag-Of-Concepts: Comprehending document representation through clustering words in distributed representation. Neurocomputing, 266, 336-352. https://doi.org/10.1016/j.neucom.2017.05.046 
  27. Larkey, L. S., Ballesteros, L., & Connell, M. E. (2002, August 11-15). Improving stemming for Arabic information retrieval: Light stemming and co-occurrence analysis. Paper presented at Proceedings of the 25th annual international ACM SIGIR conference on Research and Development in Information Retrieval, Tampere, Finland. 
  28. Lavrenko, V., & Croft, W. B. (2017, August 2). Relevance-based language models. Paper presented at ACM SIGIR Forum, New York, USA. 
  29. Lv, Y., & Zhai, C. (2011, October 24-28). Lower-bounding term frequency normalization. Paper presented at Proceedings of the 20th ACM International Conference on Information and Knowledge Management, Glasgow, UK. 
  30. Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013, December 5-10). Distributed representations of words and phrases and their compositionality. Paper presented at Advances in Neural Information Processing Systems, Lake Tahoe, NV, USA. 
  31. Miyanishi, T., Seki, K., & Uehara, K. (2013, October 27-November 1). Improving pseudo-relevance feedback via tweet selection. Paper presented at Proceedings of The 22nd ACM International Conference on Information & Knowledge Management, San Francisco, CA, USA. 
  32. Mohsen, G., Al-Ayyoub, M., Hmeidi, I., & Al-Aiad, A. (2018, April 9-11). On the automatic construction of an Arabic thesaurus. Paper presented at 2018 9th International Conference on Information and Communication Systems (ICICS), Amman, Jordan. 
  33. Mukherjee, S., & Kumar, N. (2019, December 12-14). Duplicate question management and answer verification system. Paper presented at 2019 IEEE Tenth International Conference on Technology for Education (T4E), Bhubaneswar, India. 
  34. Mustafa, M., AbdAlla, H., & Suleman, H. (2008, December 2-5). Current approaches in Arabic IR: A survey. Paper presented at International Conference on Asian Digital Libraries, Hyderabad, India. 
  35. Nwesri, A. F. A., & Alyagoubi, H. A. (2015, August 31-September 4). Applying Arabic stemming using query expansion. Paper presented at 2015 26th International Workshop on Database and Expert Systems Applications (DEXA), Vienna, Austria. 
  36. Pal, D., Mitra, M., & Datta, K. (2014). Improving query expansion using WordNet. Journal of the Association for Information Science and Technology, 65(12), 2469-2478. https://doi.org/10.1002/asi.23143 
  37. Raza, M. A., Mokhtar, R., Ahmad, N., Pasha, M., & Pasha, U. (2019). A taxonomy and survey of semantic approaches for query expansion. IEEE Access, 7, 17823-17833. https://doi.org/10.1109/ACCESS.2019.2894679 
  38. Robertson, S., & Zaragoza, H. J. F. (2009). The probabilistic relevance framework: BM25 and beyond. Foundations and Trends® in Information Retrieval, 3(4), 333-389. https://doi.org/10.1561/1500000019 
  39. Robertson, S. E., Walker, S., Jones, S., Hancock-Beaulieu, M. M., & Gatford, M. (1995). Okapi at TREC-3. Paper presented at Third Text REtrieval Conference (TREC-3), Gaithersburg, MD, USA. 
  40. Roy, D., Paul, D., Mitra, M., & Garain, U. (2016). Using word embeddings for automatic query expansion. Paper presented at CoRR, Austin, TX, USA. 
  41. Takeuchi, S. I., Sugiura, K., Akahoshi, Y., & Zettsu, K. (2017). Spatio-temporal pseudo relevance feedback for scientific data retrieval. IEEJ Transactions on Electrical and Electronic Engineering, 12(1), 124-131. https://doi.org/10.1002/tee.22352 
  42. Turney, P. D., & Pantel, P. (2010). From frequency to meaning: Vector space models of semantics. Journal of Artificial Intelligence Research, 37, 141-188. https://doi.org/10.1613/jair.2934 
  43. Zamani, H., & Croft, W. B. (2016, September 12-16). Embedding-based query language models. Paper presented at Proceedings of The 2016 ACM International Conference on the Theory of Information Retrieval, Delaware, Newark, USA. 
  44. Zou, S., Tao, G., Wang, J., Zhang, W., & Zhang, D. (2018, July 8-12). On the equilibrium of query reformulation and document retrieval. Paper presented at Proceedings of the 2018 ACM SIGIR International Conference on Theory of Information Retrieval, Ann Arbor, MI, USA.