Ternary Decomposition and Dictionary Extension for Khmer Word Segmentation

  • Received : 2015.10.27
  • Accepted : 2016.05.04
  • Published : 2016.06.30


In this paper, we proposed a dictionary extension and a ternary decomposition technique to improve the effectiveness of Khmer word segmentation. Most word segmentation approaches depend on a dictionary. However, the dictionary being used is not fully reliable and cannot cover all the words of the Khmer language. This causes an issue of unknown words or out-of-vocabulary words. Our approach is to extend the original dictionary to be more reliable with new words. In addition, we use ternary decomposition for the segmentation process. In this research, we also introduced the invisible space of the Khmer Unicode (char\u200B) in order to segment our training corpus. With our segmentation algorithm, based on ternary decomposition and invisible space, we can extract new words from our training text and then input the new words into the dictionary. We used an extended wordlist and a segmentation algorithm regardless of the invisible space to test an unannotated text. Our results remarkably outperformed other approaches. We have achieved 88.8%, 91.8% and 90.6% rates of precision, recall and F-measurement.



  1. Channa, V. and Kameyama, W., Khmer Word Segmentation and Out-of-Vocabulary Words Detection Using Collection Measurement of Repeated Characters Subsequences, 2010.
  2. Chea, S., Top, R., and Ros, P., Detection and Correction of Homophonous Error Word for Khmer Language, 2004.
  3. Chea, S., Top, R., and Ros, P., Word Bigram Vs Orthographic Syllable Bigram in Khmer Word Segmentation, 2004.
  4. Church, K. W., Robert, L., and Mark, L. Y., A Status Report on ACL/DCL, 1991, pp. 84-91.
  5. Huffman, F. E., Cambodian System of Writing and beginning reader with Drills and Glossary, Yale University Press, 1970.
  6. Khin, S., "Khmer Grammar", Royal Academy of Cambodia, first Edition, 2007.
  7. Khmer Dictionary, Royal Academy of Cambodia, 2005.
  8. Mohri, M. F., Pereira, C. N., and Riley, M., "A rational design for a weighted finitestate transducer library", in Lecture Notes in Computer Science, Springer, 1998, pp. 144-158.
  9. Nevill-Manning, C. G., "Identifying Hierarchical Structure in Sequences A linear-time algorithm", Journal of Artificial Intelligence Research, Vol. 7, No. 1, 1997, pp. 67-82.
  10. Nou, C. and Kameyama, W., Hybrid Approach for Khmer Unknown Word POS Guessing, 2007.
  11. Puthick, H., Development of a Khmer Spell Checker Based on a Hidden Markov Model, A subthesis submitted in partial fulfillment of the degree of Master of Information Technology (eScience) at The Department of Computer Science Australian National University November, 2005.
  12. Seng, S., Abate, S. T., and Besacier, L., "Boosting N-gram Coverage for Unsegmented Languages Using Multiple Text Segmentation Approach", Proceedings of the 1st Workshop on South and Southeast Asian Natural Language Processing (WSSANLP), 2010, pp. 1-7.
  13. Seng, S., Besacier, L., Bigi, B., and Castelli, E., Multiple Text Segmentation for Statistical Language Modelling, 1LIG Laboratory, CNRS/UMR-5217, Grenoble, France 2MICA Center, HUT-CNRS/UMI-2954-Grenoble INP, Hanoi, Vietnam, 2009.
  14. Seng, S., Sam, S., Le, V.-B., Bigi, B., and Besacier, L., "Which Units for acoustic and language modelling for Khmer automatic speech recognition?", 38041 Grenoble Cedex 9, FRANCE, 2010.
  15. Shannon, E., "A Mathematical Theory of Communication", Bell System Technical Journal, Vol. 27, 1948, pp. 379-423.
  16. Thanopoulos, A., Fakotakis, N., and Kokkinakis, G., Comparative Evaluation of Collocation Extraction Metrics, 2002.
  17. Van, C. and W. Kameyama, "Query Expansion for Khmer Information Retrieval", Proceedings of the 8th Workshop on Asian Language Resources, Beijing, 2010, pp. 80-87.