Browse > Article
http://dx.doi.org/10.22937/IJCSNS.2022.22.7.9

Building Hybrid Stop-Words Technique with Normalization for Pre-Processing Arabic Text  

Atwan, Jaffar (Al-Balqa Applied University)
Publication Information
International Journal of Computer Science & Network Security / v.22, no.7, 2022 , pp. 65-74 More about this Journal
Abstract
In natural language processing, commonly used words such as prepositions are referred to as stop-words; they have no inherent meaning and are therefore ignored in indexing and retrieval tasks. The removal of stop-words from Arabic text has a significant impact in terms of reducing the size of a cor- pus text, which leads to an improvement in the effectiveness and performance of Arabic-language processing systems. This study investigated the effectiveness of applying a stop-word lists elimination with normalization as a preprocessing step. The idea was to merge statistical method with the linguistic method to attain the best efficacy, and comparing the effects of this two-pronged approach in reducing corpus size for Ara- bic natural language processing systems. Three stop-word lists were considered: an Arabic Text Lookup Stop-list, Frequency- based Stop-list using Zipf's law, and Combined Stop-list. An experiment was conducted using a selected file from the Arabic Newswire data set. In the experiment, the size of the cor- pus was compared after removing the words contained in each list. The results showed that the best reduction in size was achieved by using the Combined Stop-list with normalization, with a word count reduction of 452930 and a compression rate of 30%.
Keywords
Arabic; Normalization; Preprocessing; StopWords; Zipf's law 2012 ACM Computing Classification System: Computing methodologies; Artificial intelligence; Natural language processing; ACM Computing Classification System: Computing methodologies; Artificial intelligence; Natural language processing;
Citations & Related Records
Times Cited By KSCI : 1  (Citation Analysis)
연도 인용수 순위
1 Al-Shalabi, Riyad, G. Kanaan, M. Yaseen, B. Al- Sarayreh, and N. Al-Naji, "Arabic query expansion using interactive word sense disambiguation," in Proceedings of the Second International Conference on Arabic Language Resources and Tools, 2009.
2 E. T. Al-Shammari, "Lemmatizing, stemming, and query expansion method and system," Google Patents. Available at, 2013.
3 L. S. Larkey, L. Ballesteros, and M. E. Connell, "Light stemming for Arabic information retrieval," Arabic computational morphology, pp. 221-243, 2007.
4 S. Khoja, "APT: Arabic part-of-speech tagger," Proceedings of the Student Workshop at NAACL, pp. 20-25, 2001.
5 W. B. Croft, D. Metzler, and T. Strohman, "Addison-Wesley Reading," Search engines: Information retrieval in practice, vol. 520, 2010.
6 A. Masrai and J. Milton, "How different is Arabic from other languages? The relationship between word frequency and lexical coverage," Journal of Applied Linguistics and Language Research, vol. 3, no. 1, pp. 15-35, 2016.
7 I. A. El-Khair, "Effects of stop words elimination for Arabic information retrieval: a comparative study," International Journal of Computing & Information Sciences, vol. 4, no. 3, pp. 119-133, 2006.
8 J. Atwan, M. Mohd, and G. Kanaan, "Enhanced arabic information retrieval: Light stemming and stop words," International Multi-Conference on Artificial Intelligence Technology, pp. 219-228,
9 E. L. Lydia, P. K. Kumar, K. Shankar, S. K. Lakshmanaprabu, R. M. Vidhyavathi, and A. Maseleno, "Charismatic document clustering through novel K-Means non-negative matrix factorization (KNMF) algorithm using key phrase extraction," International Journal of Parallel Programming, vol. 48, no. 3, pp. 496-514, 2020.   DOI
10 C. Fox, "A stop list for general text," Acm Sigir Forum, vol. 24, no. 1-2, pp. 19-21, 1989.
11 B. Al-Salemi and M. J. A. Aziz, "Statistical bayesian learning for automatic arabic text categorization," Journal of Computer Science, vol. 7, no. 1, 2011.
12 R. Baeza-Yates and B. Ribeiro-Neto, 1999.
13 Al-Shalabi, Riyadh, G. Kanaan, J. M. Jaam, A. Hasnah, and E. Hilat, "Stop-word removal algorithm for Arabic language," Proceedings. 2004 International Conference on Information and Communication Technologies: From Theory to Applications, vol. 545, 2004.
14 H. Schutze, C. D. Manning, and P. Raghavan, 2008.
15 S. H. Mustafa, "Character contiguity in N-gram-based word matching: the case for Arabic text searching," Information Processing & Management, vol. 41, no. 4, pp. 819-827, 2005.   DOI
16 E. Al-Shammari and J. Lin, "A novel Arabic lemmatization algorithm," Proceedings of the Second Workshop on Analytics for Noisy Unstructured Text Data, pp. 113-118, 2008.
17 A. W. Pradana and M. Hayaty, "The effect of stemming and removal of stopwords on the accuracy of sentiment analysis on indonesian-language texts," Kinetik: Game Technology, Information System, Computer Network, Computing, Electronics, and Control, pp. 375-380, 2019.
18 J. Atwan and M. Mohd, "Arabic Query Expansion: A Review," Asian Journal of Information Technology, vol. 16, no. 10, pp. 754-770, 2017.
19 B. F. Willian and B. Y. Ricardo, 1999.
20 A. Cole, D. Graff, and K. Walker, "Arabic Newswire Part 1 Corpus (1-58563-190-6)," Linguistic Data Consortium (LDC). Available at, 2001.
21 J. Atwan, M. Mohd, H. Rashaideh, and G. Kanaan, 1999"Se- mantically enhanced pseudo relevance feedback for ara- bic information retrieval," Journal of Information Sci- ence, vol. 42, no. 2, pp. 246-260, 2016.   DOI
22 R. Elbarougy, G. Behery, and A. Khatib, "A Proposed Natural Language Processing Preprocessing Procedures for Enhancing Arabic Text Summarization," Studies in Computational Intelligence, vol. 874, pp. 39-57, 2020.
23 S. Sarica and J. Luo, 2020.
24 Y. Hacohen-Kerner, D. Miller, and Y. Yigal, "The influence of preprocessing on text classification using a bag-of-words representation," PloS One, vol. 15, no. 5, 2020.
25 B. Alhadidi and M. Alwedyan, "Hybrid Stop-Word Removal Technique for Arabic Language," Egyptian Computer Science Journal, vol. 30, no. 1, pp. 35-38, 2008.