DOI QR코드

DOI QR Code

Building Hybrid Stop-Words Technique with Normalization for Pre-Processing Arabic Text

  • Received : 2022.07.05
  • Published : 2022.07.30

Abstract

In natural language processing, commonly used words such as prepositions are referred to as stop-words; they have no inherent meaning and are therefore ignored in indexing and retrieval tasks. The removal of stop-words from Arabic text has a significant impact in terms of reducing the size of a cor- pus text, which leads to an improvement in the effectiveness and performance of Arabic-language processing systems. This study investigated the effectiveness of applying a stop-word lists elimination with normalization as a preprocessing step. The idea was to merge statistical method with the linguistic method to attain the best efficacy, and comparing the effects of this two-pronged approach in reducing corpus size for Ara- bic natural language processing systems. Three stop-word lists were considered: an Arabic Text Lookup Stop-list, Frequency- based Stop-list using Zipf's law, and Combined Stop-list. An experiment was conducted using a selected file from the Arabic Newswire data set. In the experiment, the size of the cor- pus was compared after removing the words contained in each list. The results showed that the best reduction in size was achieved by using the Combined Stop-list with normalization, with a word count reduction of 452930 and a compression rate of 30%.

Keywords

Acknowledgement

We would like to thank the Linguistic Data Consortium (LDC) for providing us with the LDC2001T55 Arabic Newswire Part 1 data set at no cost, and for awarding us with the fall 2012 LDC Data Scholarship.

References

  1. R. Elbarougy, G. Behery, and A. Khatib, "A Proposed Natural Language Processing Preprocessing Procedures for Enhancing Arabic Text Summarization," Studies in Computational Intelligence, vol. 874, pp. 39-57, 2020.
  2. C. Fox, "A stop list for general text," Acm Sigir Forum, vol. 24, no. 1-2, pp. 19-21, 1989.
  3. E. T. Al-Shammari, "Lemmatizing, stemming, and query expansion method and system," Google Patents. Available at, 2013.
  4. L. S. Larkey, L. Ballesteros, and M. E. Connell, "Light stemming for Arabic information retrieval," Arabic computational morphology, pp. 221-243, 2007.
  5. S. Khoja, "APT: Arabic part-of-speech tagger," Proceedings of the Student Workshop at NAACL, pp. 20-25, 2001.
  6. W. B. Croft, D. Metzler, and T. Strohman, "Addison-Wesley Reading," Search engines: Information retrieval in practice, vol. 520, 2010.
  7. Al-Shalabi, Riyad, G. Kanaan, M. Yaseen, B. Al- Sarayreh, and N. Al-Naji, "Arabic query expansion using interactive word sense disambiguation," in Proceedings of the Second International Conference on Arabic Language Resources and Tools, 2009.
  8. A. Masrai and J. Milton, "How different is Arabic from other languages? The relationship between word frequency and lexical coverage," Journal of Applied Linguistics and Language Research, vol. 3, no. 1, pp. 15-35, 2016.
  9. Y. Hacohen-Kerner, D. Miller, and Y. Yigal, "The influence of preprocessing on text classification using a bag-of-words representation," PloS One, vol. 15, no. 5, 2020.
  10. I. A. El-Khair, "Effects of stop words elimination for Arabic information retrieval: a comparative study," International Journal of Computing & Information Sciences, vol. 4, no. 3, pp. 119-133, 2006.
  11. S. Sarica and J. Luo, 2020.
  12. A. W. Pradana and M. Hayaty, "The effect of stemming and removal of stopwords on the accuracy of sentiment analysis on indonesian-language texts," Kinetik: Game Technology, Information System, Computer Network, Computing, Electronics, and Control, pp. 375-380, 2019.
  13. E. L. Lydia, P. K. Kumar, K. Shankar, S. K. Lakshmanaprabu, R. M. Vidhyavathi, and A. Maseleno, "Charismatic document clustering through novel K-Means non-negative matrix factorization (KNMF) algorithm using key phrase extraction," International Journal of Parallel Programming, vol. 48, no. 3, pp. 496-514, 2020. https://doi.org/10.1007/s10766-018-0591-9
  14. R. Baeza-Yates and B. Ribeiro-Neto, 1999.
  15. Al-Shalabi, Riyadh, G. Kanaan, J. M. Jaam, A. Hasnah, and E. Hilat, "Stop-word removal algorithm for Arabic language," Proceedings. 2004 International Conference on Information and Communication Technologies: From Theory to Applications, vol. 545, 2004.
  16. H. Schutze, C. D. Manning, and P. Raghavan, 2008.
  17. B. Alhadidi and M. Alwedyan, "Hybrid Stop-Word Removal Technique for Arabic Language," Egyptian Computer Science Journal, vol. 30, no. 1, pp. 35-38, 2008.
  18. B. Al-Salemi and M. J. A. Aziz, "Statistical bayesian learning for automatic arabic text categorization," Journal of Computer Science, vol. 7, no. 1, 2011.
  19. S. H. Mustafa, "Character contiguity in N-gram-based word matching: the case for Arabic text searching," Information Processing & Management, vol. 41, no. 4, pp. 819-827, 2005. https://doi.org/10.1016/j.ipm.2004.02.003
  20. E. Al-Shammari and J. Lin, "A novel Arabic lemmatization algorithm," Proceedings of the Second Workshop on Analytics for Noisy Unstructured Text Data, pp. 113-118, 2008.
  21. J. Atwan and M. Mohd, "Arabic Query Expansion: A Review," Asian Journal of Information Technology, vol. 16, no. 10, pp. 754-770, 2017.
  22. A. Cole, D. Graff, and K. Walker, "Arabic Newswire Part 1 Corpus (1-58563-190-6)," Linguistic Data Consortium (LDC). Available at, 2001.
  23. B. F. Willian and B. Y. Ricardo, 1999.
  24. J. Atwan, M. Mohd, H. Rashaideh, and G. Kanaan, 1999"Se- mantically enhanced pseudo relevance feedback for ara- bic information retrieval," Journal of Information Sci- ence, vol. 42, no. 2, pp. 246-260, 2016. https://doi.org/10.1177/0165551515594722
  25. J. Atwan, M. Mohd, and G. Kanaan, "Enhanced arabic information retrieval: Light stemming and stop words," International Multi-Conference on Artificial Intelligence Technology, pp. 219-228,