DOI QR코드

DOI QR Code

A Hybrid Approach for the Morpho-Lexical Disambiguation of Arabic

  • Received : 2015.04.03
  • Accepted : 2015.10.15
  • Published : 2016.09.30

Abstract

In order to considerably reduce the ambiguity rate, we propose in this article a disambiguation approach that is based on the selection of the right diacritics at different analysis levels. This hybrid approach combines a linguistic approach with a multi-criteria decision one and could be considered as an alternative choice to solve the morpho-lexical ambiguity problem regardless of the diacritics rate of the processed text. As to its evaluation, we tried the disambiguation on the online Alkhalil morphological analyzer (the proposed approach can be used on any morphological analyzer of the Arabic language) and obtained encouraging results with an F-measure of more than 80%.

Keywords

References

  1. A. Tchechmedjiev, "Etat de l'art: mesures de similarite semantique locales et algorithmes globaux pour la desambiguisation lexicale a base de connaissances," in Proceedings of Actes de la conference conjointe JEP-TALN-RECITAL 2012, volume 3: RECITAL, Grenoble, France, 2012, pp. 295-308.
  2. L. Audibert, "Desambiguisation lexicale automatique: selection automatique d'indices," in Proceedings of Traitement Automatique des Langues Naturelles (TALN-2007), Toulouse, France, 2007, pp. 13-22.
  3. M. Rakho, G. Pitel, and C. Mouton, "Desambiguisation automatique a partir d'espaces vectoriels multiples cluterises," Universite Paris 7 - Diderot, Rapport Intermediaire, 2008.
  4. R. Navigli, "Word sense disambiguation: a survey," ACM Computing Surveys, vol. 41, no. 2, article no. 10, 2009.
  5. A. Alsaad and M. Abbod, "Arabic text root extraction via morphological analysis and linguistic constraints," in Proceedings of 2014 UKSim-AMSS 16th International Conference on Computer Modelling and Simulation (UKSim), Cambridge, UK, 2014, pp. 125-130.
  6. A. Al-Arfaj and A. Al-Salman, "Arabic NLP tools for ontology construction from Arabic text: an overview," in Proceedings of 2015 International Conference on Electrical and Information Technologies (ICEIT), Marrakech, Moroco, 2015, pp. 246-251.
  7. L. Belguith and A. Ben Hamadou, "Traitement des erreurs d'accord: Une analyse syntagmatique pour la detection et une analyse multicritere pour la correction," Revue d'intelligence artificielle, vol. 18, no. 5-6, pp. 679-707, 2004. https://doi.org/10.3166/ria.18.679-707
  8. M. Sawalha and E. Atwell, "Adapting language grammar rules for building morphological analyzer for Arabic language," in Proceedings of the Workshop of Morphological Analyzer Experts for Arabic Language, Damascus, Syria, 2009.
  9. R. Ouersighni, "La conception et la realisation d'un systeme d'analyse morpho-syntaxique robuste pour l'arabe: utilisation pour la detection et le diagnostic des fautes d'accord," Ph.D. dissertation, Universite Lumiere Lyon 2, 2002.
  10. A. Farghaly and K. Shaalan, "Arabic natural language processing: Challenges and solutions," ACM Transactions on Asian Language Information Processing, vol. 8, no. 4, article no. 14, 2009.
  11. E. Souissi, "Etiquetage grammatical de l'arabe voyelle ou non," Ph.D. dissertation, Universite de Paris VII, 1997.
  12. F. Debili, H. Achour, and E. Souissi, "La langue arabe et l'ordinateur de l'etiquetage gramatical a la voyellation automatique," Correspondances: bulletin de l'IRMC, vol. 2002, no. 71, pp. 10-26, 2002.
  13. R. Shah, P. S. Dhillon, M. Liberman, D. Foster, M. Maamouri, and L. Ungar, "A new approach to lexical disambiguation of Arabic text," in Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing (EMNLP'10), MIT Stata Center, MA, 2010, pp. 725-735.
  14. A. M. Azmi and R. S. Almajed, "A survey of automatic Arabic diacritization techniques," Natural Language Engineering, vol. 21, no. 3, pp. 477-495, 2015. https://doi.org/10.1017/S1351324913000284
  15. A. A. Alzand and I. Rosziati, "Diacritics of Arabic natural language processing (ANLP) and its quality assessment," in Proceedings of the 2015 International Conference on Industrial Engineering and Operations Management (IEOM2015), Dubai, United Arab Emirates (UAE), 2015.
  16. R. A. Haertel, P. McClanahan, and E. K. Ringger, "Automatic diacritization for low-resource languages using a hybrid word and consonant CMM," in Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics. Stroudsburg, PA: Association for Computational Linguistics, 2010, pp. 519-527.
  17. G. A. Abandah, A. Graves, B. Al-Shagoor, A. Arabiyat, F. Jamour, and M. Al-Taee, "Automatic diacritization of Arabic text using recurrent neural networks," International Journal on Document Analysis and Recognition (IJDAR), vol. 18, no. 2, pp. 183-197, 2015. https://doi.org/10.1007/s10032-015-0242-2
  18. M. Diab, K. Hacioglu, and D. Jurafsky, "Automatic tagging of Arabic text: from raw text to base phrase chunks," in Proceedings of HLT-NAACL 2004: Short Papers, Boston, MA, 2004, pp. 149-152.
  19. M. Diab, M. Ghoneim, and N. Habash, "Arabic diacritization in the context of statistical machine translation," in Proceedings of Machine Translation Summit XI (MT-Summit), Copenhagen, Denmark, 2007.
  20. M. El-Beze, B. Merialdo, B. Rozeron, and A. M. Derouault, "Accentuation automatique de textes par des methodes probabilistes," Technique et Science Informatiques, vol. 13, no. 6, pp. 797-815, 1994.
  21. M. Maamouri, A. Bies, and S. Kulick, "Diacritization: a challenge to Arabic treebank annotation and parsing," in Proceedings of the British Computer Society Arabic NLP/MT Conference, London, 2006.
  22. A. O. Bahanshal and H. S. Al-Khalifa, "A first approach to the evaluation of Arabic diacritization systems," in Proceedings of 2012 Seventh International Conference on Digital Information Management (ICDIM), Macau, 2012, pp. 155-158.
  23. Y. A. Gal, "An HMM approach to vowel restoration in Arabic and Hebrew," in Proceedings of the ACL-02 Workshop on Computational Approaches to Semitic Languages, Philadelphia, PA, 2002, pp. 1-7.
  24. R. Nelken and S. M. Shieber, "Arabic diacritization using weighted finite-state transducers," in Proceedings of the ACL Workshop on Computational Approaches to Semitic Languages, Ann Arbor, MI, 2005, pp. 79-86.
  25. K. Shaalan, "Rule-based approach in Arabic natural language processing," International Journal on Information and Communication Technologies (IJICT), vol. 3, no. 3, pp. 11-19, 2010.
  26. I. Zitouni and R. Sarikaya, "Arabic diacritic restoration approach based on maximum entropy models," Computer Speech & Language, vol. 23, no. 3, pp. 257-276, 2009. https://doi.org/10.1016/j.csl.2008.06.001
  27. M. Maamouri, A. Bies, T. Buckwalter, and W. Mekki, "The Penn Arabic treebank: building a large-scale annotated Arabic corpus," in Proceedings of NEMLAR Conference on Arabic Language Resources and Tools, Cairo, Egypt, 2004, pp. 102-109.
  28. N. Habash and O. Rambow, "Arabic diacritization through full morphological tagging," in Proceedings of Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Companion Volume, Short Papers, Rochester, NY, pp. 53-56.
  29. T. Buckwalter, Buckwalter Arabic Morphological Analyzer Version 2.0. Philadelphia, PA: Linguistic Data Consortium, 2004.
  30. A. Stolcke, "SRILM: an extensible language modeling toolkit," in Proceedings of the 7th International Conference on Spoken Language Processing (ICSLP), Denver, CO, 2002, pp. 1-4.
  31. R. Roth, O. Rambow, N. Habash, M. Diab, and C. Rudin, "Arabic morphological tagging, diacritization, and lemmatization using lexeme models and feature ranking," in Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies: Short Papers, Columbus, OH, 2008, pp. 117-120.
  32. M. A. Rashwan, M. A. Al-Badrashiny, M. Attia, S. M. Abdou, and A. Rafea, "A stochastic Arabic diacritizer based on a hybrid of factorized and unfactorized textual features," IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, no. 1, pp. 166-175, 2001. https://doi.org/10.1109/TASL.2010.2045240
  33. A. Said, M. El-Sharqwi, A. Chalabi, and E. Kamal, "A hybrid approach for Arabic diacritization," in Natural Language Processing and Information Systems. Heidelberg: Springer, 2013, pp. 53-64.
  34. M. Alghamdi, Z. Muzaffar, and H. Alhakami, "Automatic restoration of Arabic diacritics: a simple, purely statistical approach," Arabian Journal for Science and Engineering, vol. 35, no. 2, pp. 125-135, 2010.
  35. Y. Hifny, "Smoothing techniques for Arabic diacritics restoration," in Proceedings of 12th Conference on Language Engineering (ESOLEC'12), Cairo, Egypt, 2012, pp. 6-12.
  36. A. Scharlig, Decider sur plusieurs criteres: panorama de l'aide a la decision multicritere. Lausanne: Presses polytechniques et universitaires romandes, 1985.
  37. B. Roy and D. Bouyssou, Aide multicritere a la decision: methodes et cas. Paris: Economica, 1993.
  38. Alkhalil Morpho Sys version 1.3, 2011; http://sourceforge.net/projects/alkhalil/.
  39. L. Belguith, L. Baccour, and G. Mourad, "Segmentation de textes arabes basee sur l'analyse contextuelle des signes de ponctuations et de certaines particules," in Actes de la 12eme Conference annuelle sur le Traitement Automatique des Langues Naturelles, Dourdan, France, 2005, pp. 451-456.
  40. M. Yassen, K..Choukri, N. Paulsson., S. Haamid. and all "Building Annotated Written and Spoken Arabic LRs in NEMLAR Project," in Proceedings of International Conference on Language Resources and Evaluation (LREC), 2006.
  41. A. Haddad, H. B. Ghezala, and M. Ghnima, "Conception d'un categoriseur morphologique fonde sur le principe d'Eric Brill dans un contexte multi-agents," in Proceedings of 26th Conference on Lexis and Grammar, Bonifacio, France, 2007, pp. 1-8.
  42. K. Belkacem and S. Abderrahmane, "Using augmented transition network for morphological processing of Arabic," International Journal of Computer Applications, vol. 25, no. 10, pp. 22-27, 2011. https://doi.org/10.5120/3149-4353
  43. M. A. Attia, "Handling Arabic morphological and syntactic ambiguity within the LFG framework with a view to machine translation," Ph.D. dissertation, University of Manchester, UK, 2008.
  44. W. A. Woods, "Transition network grammars for natural language analysis," Communications of the ACM, vol. 13, no. 10, pp. 591-606, 1970. https://doi.org/10.1145/355598.362773
  45. K. R. Beesley, "Finite-state morphological analysis and generation of Arabic at Xerox Research: status and plans in 2001," in Proceedings of ACL Workshop on ARABIC Language Processing: Status and Perspective, Toulouse, France, 2001, pp. 1-8.
  46. N. Habash, Introduction to Arabic Natural Language Processing. San Rafael, CA: Morgan & Claypool Publishers, 2010.
  47. P. Vincke, L'aide multicritere a la decision. Bruxelles: Editions de l'universite de Bruxelles, 1989.
  48. C. L. Hwang and K. Yoon, Multiple Attribute Decision Making: Methods and Applications: A State-ofthe-Art Survey. Berlin: Springer, 1981.
  49. L. Belguith and N. Chaaben, "Analyse et desambiguisation morphologiques de textes arabes non voyelles," in Actes de la 13eme confrence sur le Traitement Automatique des Langues Naturelles, Leuven, Belgium, 2006, pp. 493-501.
  50. K. Toutanova, D. Klein, C. D. Manning, and Y. Singer, "Feature-rich part-of-speech tagging with a cyclic dependency network," in Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology, Edmonton, Canada, 2003, pp. 173-180.
  51. S. Khoja, "APT: Arabic part-of-speech tagger," in Proceedings of the Student Workshop at North American Chapter of the Association for Computational Linguistics (NAACL2001), Pittsburg, PA, 2001, pp. 20-25.
  52. J. Gimenez and L. Marquez, "SVMTool: a general POS tagger generator based on support vector machines," in Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC 2004), Lisbon, Portugal, 2004.
  53. M. Diab, "Second generation AMIRA tools for Arabic processing: fast and robust tokenization, POS tagging, and base phrase chunking," in Proceedings of 2nd International Conference on Arabic Language Resources and Tools, Cairo, Egypt, 2009, pp. 285-288.
  54. N. Habash, O. Rambow, and R. Roth, "MADA+ TOKAN: a toolkit for Arabic tokenization, diacritization, morphological disambiguation, POS tagging, stemming and lemmatization," in Proceedings of the 2nd International Conference on Arabic Language Resources and Tools (MEDAR), Cairo, Egypt, 2009, pp. 102-109.