DOI QR코드

DOI QR Code

Bi-directional Maximal Matching Algorithm to Segment Khmer Words in Sentence

  • Mao, Makara (Dept. of Software Convergence, Soonchunhyang University) ;
  • Peng, Sony (Dept. of Software Convergence, Soonchunhyang University) ;
  • Yang, Yixuan (Dept. of Software Convergence, Soonchunhyang University) ;
  • Park, Doo-Soon (Dept. of Computer Software Engineering, Soonchunhyang University)
  • Received : 2021.09.09
  • Accepted : 2021.11.29
  • Published : 2022.08.31

Abstract

In the Khmer writing system, the Khmer script is the official letter of Cambodia, written from left to right without a space separator; it is complicated and requires more analysis studies. Without clear standard guidelines, a space separator in the Khmer language is used inconsistently and informally to separate words in sentences. Therefore, a segmented method should be discussed with the combination of the future Khmer natural language processing (NLP) to define the appropriate rule for Khmer sentences. The critical process in NLP with the capability of extensive data language analysis necessitates applying in this scenario. One of the essential components in Khmer language processing is how to split the word into a series of sentences and count the words used in the sentences. Currently, Microsoft Word cannot count Khmer words correctly. So, this study presents a systematic library to segment Khmer phrases using the bi-directional maximal matching (BiMM) method to address these problematic constraints. In the BiMM algorithm, the paper focuses on the Bidirectional implementation of forward maximal matching (FMM) and backward maximal matching (BMM) to improve word segmentation accuracy. A digital or prefix tree of data structure algorithm, also known as a trie, enhances the segmentation accuracy procedure by finding the children of each word parent node. The accuracy of BiMM is higher than using FMM or BMM independently; moreover, the proposed approach improves dictionary structures and reduces the number of errors. The result of this study can reduce the error by 8.57% compared to FMM and BFF algorithms with 94,807 Khmer words.

Keywords

Acknowledgement

This research was supported by the National Research Foundation of Korea (No. NRF-2020RIA2B5B01002134) and the BK21 FOUR (Fostering Outstanding Universities for Research; No. 5199990914048).

References

  1. C. Ding, M. Utiyama, and E. Sumita, "NOVA: a feasible and flexible annotation system for joint tokenization and part-of-speech tagging," ACM Transactions on Asian and Low-Resource Language Information Processing, vol. 18, no. 2, article no. 17, 2019. https://doi.org/10.1145/3276773
  2. R. Buoy, S. Kor, and N. Taing, "An end-to-end Khmer optical character recognition using sequence-to-sequence with attention," 2021 [Online]. Available: https://arxiv.org/abs/2106.10875.
  3. X. Yan, X. Xiong, X. Cheng, Y. Huang, H. Zhu, and F. Hu, "HMM-BiMM: hidden Markov model-based word segmentation via improved bi-directional maximal matching algorithm," Computers & Electrical Engineering, vol. 94, article no. 107354, 2021. https://doi.org/10.1016/j.compeleceng.2021.107354
  4. M. Sassano, "Deterministic word segmentation using maximum matching with fully lexicalized rules," in Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics (EACL), Volume 2: Short Papers, Gothenburg, Sweden, 2014, pp. 79-83.
  5. C. Ding, Y. K. Thu, M. Utiyama, and E. Sumita, "Word segmentation for Burmese (Myanmar)," ACM Transactions on Asian and Low-Resource Language Information Processing, vol. 15, no. 4, article no. 22, 2016. https://doi.org/10.1145/2846095
  6. J. M. Nobel, S. Puts, J. Weiss, H. J. Aerts, R. H. Mak, S. G. Robben, and A. L. Dekker, "T-staging pulmonary oncology from radiological reports using natural language processing: translating into a multi-language setting," Insights into Imaging, vol. 12, article no. 77, 2021. https://doi.org/10.1186/s13244-021-01018-1
  7. S. Liang, K. Stockinger, T. M. de Farias, M. Anisimova, and M. Gil, "Querying knowledge graphs in natural language," Journal of Big Data, vol. 8, article no. 3, 2021. https://doi.org/10.1186/s40537-020-00383-w
  8. D. Cao, X. Ren, M. Zhu, and W. Song, "Visual question answering research on multi-layer attention mechanism based on image target features," Human-centric Computing and Information Sciences, vol. 11, article no. 11, 2021. https://doi.org/10.22967/HCIS.2021.11.011
  9. M. Kuzma and A. Moscicka, "Evaluation of metadata describing topographic maps in a National Library," Heritage Science, vol. 8, article no. 113, 2020. https://doi.org/10.1186/s40494-020-00455-3
  10. H. Christian, D. Suhartono, A. Chowanda, and K. Z. Zamli, "Text based personality prediction from multiple social media data sources using pre-trained language model and model averaging," Journal of Big Data, vol. 8, article no. 68, 2021. https://doi.org/10.1186/s40537-021-00459-1
  11. H. Kamper, A. Jansen, and S. Goldwater, "Unsupervised word segmentation and lexicon discovery using acoustic word embeddings," IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 24, no. 4, pp. 669-679, 2016. https://doi.org/10.1109/TASLP.2016.2517567
  12. C. Shorten, T. M. Khoshgoftaar, and B. Furht, "Text data augmentation for deep learning," Journal of Big Data, vol. 8, article no. 101, 2021. https://doi.org/10.1186/s40537-021-00492-0
  13. R. Buoy, N. Taing, and S. Kor, "Joint Khmer word segmentation and part-of-speech tagging using deep learning," 2021 [Online]. Available: https://arxiv.org/abs/2103.16801.
  14. K. M. Park, H. C. Cho, and H. C. Rim, "Utilizing various natural language processing techniques for biomedical interaction extraction," Journal of Information Processing Systems, vol. 7, no. 3, pp. 459-472, 2011. https://doi.org/10.3745/JIPS.2011.7.3.459
  15. K. Batsuren, E. Batbaatar, T. Munkhdalai, M. Li, O. E. Namsrai, and K. H. Ryu, "A dependency graph-based keyphrase extraction method using anti-patterns," Journal of Information Processing Systems, vol. 14, no. 5, pp. 1254-1271, 2018. https://doi.org/10.3745/JIPS.04.0091
  16. V. Chea, Y. K. Thu, C. Ding, M. Utiyama, A. Finch, and E. Sumita, "Khmer word segmentation using conditional random fields," in Proceedings of the 2nd Annual Conference on Khmer Natural Language Processing (KNLP), Phnom Penh, Cambodia, 2015, pp. 62-69.
  17. D. Li, J. Wang, M. Chen, Z. Zhang, and Z. Li, "Base-band involved integrative modeling for studying the transmission characteristics of wireless link in railway environment," EURASIP Journal on Wireless Communications and Networking, vol. 2015, article no. 81, 2015. https://doi.org/10.1186/s13638-015-0316-3
  18. F. N. A. Al Omran and C. Treude, "Choosing an NLP library for analyzing software documentation: a systematic literature review and a series of experiments," in Proceedings of 2017 IEEE/ACM 14th International Conference on Mining Software Repositories (MSR), Buenos Aires, Argentina, 2017, pp. 187-197.
  19. S. Knight, NLP at Work: The Difference that Makes the Difference, 4th ed. London, UK: Nicholas Brealey Publishing, 2020.
  20. N. Bi and N. Taing, "Khmer word segmentation based on bi-directional maximal matching for plaintext and Microsoft Word document," in Proceedings of 2014 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA), Siem Reap, Cambodia, 2014, pp. 1-9.
  21. S. Kundu and G. Sarker, "A multi-level integrator with programming based boosting for person authentication using different biometrics," Journal of Information Processing Systems, vol. 14, no. 5, pp. 1114-1135, 2018. https://doi.org/10.3745/JIPS.02.0094
  22. P. Hok, "Khmer Spell Checker," M.S. thesis, Australian National University, Canberra, Australia, 2005.
  23. S. Chea, M. Soeurn, S. Kor, and S. Srun, "Khmer word segmentation with Maximum Matching," in Proceedings of the 10th International Conference on Internet (ICONI), Phnom Penh, Cambodia, 2018.