DOI QR코드

DOI QR Code

Manchu Script Letters Dataset Creation and Labeling

  • Aaron Daniel Snowberger (Department of Information and Communication Engineering, Hanbat National University) ;
  • Choong Ho Lee (Department of Information and Communication Engineering, Hanbat National University)
  • 투고 : 2023.06.07
  • 심사 : 2023.09.25
  • 발행 : 2024.03.31

초록

The Manchu language holds historical significance, but a complete dataset of Manchu script letters for training optical character recognition machine-learning models is currently unavailable. Therefore, this paper describes the process of creating a robust dataset of extracted Manchu script letters. Rather than performing automatic letter segmentation based on whitespace or the thickness of the central word stem, an image of the Manchu script was manually inspected, and one copy of the desired letter was selected as a region of interest. This selected region of interest was used as a template to match all other occurrences of the same letter within the Manchu script image. Although the dataset in this study contained only 4,000 images of five Manchu script letters, these letters were collected from twenty-eight writing styles. A full dataset of Manchu letters is expected to be obtained through this process. The collected dataset was normalized and trained using a simple convolutional neural network to verify its effectiveness.

키워드

참고문헌

  1. M. Saarela, "The early modern travels of Manchu: A script and its study in east asia and europe," in Philadelphia: University of Pennsylvania Press, 2020. DOI: 10.9783/9780812296938.
  2. "Manchu ethnologue," Internet Archive, 2016, [Online], Available: https://web.archive.org/web/20161217235916/https://www.ethnologue.com/18/language/mnc/.
  3. D. Lague, "Manchu language lives mostly in archives," The New York Times, 17 Mar. 2007, [Online], Available: https://www.nytimes.com/2007/03/17/world/asia/18manchu_side.html.
  4. J. Miyawaki-Okada, "Report on the Manchu documents stored at the Mongolian national central archives of history," Saksaha: A Journal of Manchu Studies, vol. 4, 1999. DOI: 10.3998/saksaha.13401746.0004.002.
  5. M. C. Elliott, "The Manchu-language archives of the Qing Dynasty and the origins of the palace memorial system," Late Imperial China, vol. 22, no.1, pp. 1-70, Jun. 2001. DOI: 10.1353/late.2001.0002.
  6. G. Y. Zhang, J. J. Li, R. W. He, and A. X. Wang, "An offline recognition method of handwritten primitive Manchu characters based on strokes," in Ninth International Workshop on Frontiers in Handwriting Recognition, Kokubunji, Japan, pp. 432-437, 2004. DOI: 10.1109/IWFHR.2004.16.
  7. G. Y. Zhang, J. J. Li, and A. X. Wang, "A new recognition method for the handwritten Manchu character unit," in Proceedings of the Fifth International Conference on Machine Learning and Cybernetics, Dalian, China, pp. 3339-3344, 2006. DOI: 10.1109/ICMLC.2006.258471.
  8. S. Xu, M. Li, and M. Q. Zhu. "Manchu text extract based on fuzzy clustering," Information Technology Journal, vol. 12, no. 24, pp. 8323-8327, Dec. 2013. DOI: 10.3923/itj.2013.8323.8327.
  9. S. Xu, G. Q. Qi, M. Li, R. R. Zheng, and C. John, "An improved Manchu character recognition method," Journal of Mechanical Engineering Research and Developments, vol. 39, no. 2, pp. 536-543, 2016. DOI: 10.7508/jmerd.2016.02.033
  10. S. Xu, M. Li, R. R. Zheng, and S. Michael, "Manchu character segmentation and recognition method," Journal of Discrete Mathematical Sciences and Cryptography, vol. 20, no. 1, pp. 43-53, Dec. 2016. DOI: 10.1080/09720529.2016.1177965.
  11. D. Huang, M. Li, R. Zheng, S. Xu, and J. Bi, "synthetic data and DAG-SVM classifier for segmentation-free Manchu word recognition," in 2017 International Conference on Computing Intelligence and Information System (CIIS), Nanjing, China, pp. 46-50, 2017. DOI: 10.1109/CIIS.2017.15.
  12. M. Li, R. Zheng, S. Xu, Y. Fu, and D. Huang, "Manchu word recognition based on convolutional neural network with spatial pyramid pooling," in 2018 11th International Congress on Image and Signal Processing, BioMedical Engineering and Informatics (CISP-BMEI), Beijing, China, pp. 1-6, 2018. DOI: 10.1109/CISP-BMEI.2018.8633131.
  13. R. Zheng, M. Li, J. He, J. Bi, and B. Wu, "Segmentation-free multi-font printed Manchu word recognition using deep convolutional features and data augmentation," in 2018 11th International Congress on Image and Signal Processing, BioMedical Engineering and Informatics (CISP-BMEI), Beijing, China, pp. 1-6, 2018. DOI: 10.1109/CISP-BMEI.2018.8633208.
  14. D. D. Zhang, Y. Liu, Z. W. Wang, and D. P Wang, "OCR with the deep CNN model for ligature script-based languages like Manchu," Scientific Programming, vol. 2021, pp. 1-9, Jun. 2021. DOI: 10.1155/2021/5520338.
  15. A. Snowberger and C. H. Lee, "A new segmentation and extraction method for Manchu character units," in Proceedings for 2022 International Conference on Future Information and Communication Engineering, Jeju, Korea, pp. 42-47, 2022.
  16. S. Lipovtsov, Gospel of St. Matthew in Manchu, British and Foreign Bible Society, 1822, [Online], Available: http://orthodox.cn/bible/manchu/.
  17. Z. Jifa, Yumen tingzheng, Wenshizhe Press, 2000.
  18. P. G. von Mollendorff, A Manchu Grammar: With Analysed Texts. Windham Press, 2013.
  19. K. Yoshihiro, Manchu Written Text, University of Tokyo Press, 1996.
  20. G. R. Li, Manchu: A Textbook for Reading Documents, 2nd ed., National Foreign Language Resource Center, Honolulu, HI, 2010.
  21. T. Malisiewicz, A. Gupta, and A. A. Efros. "Ensemble of exemplar-SVMs for object detection and beyond," In International Conference on Computer Vision, 2011, [Online], Available: https://www.cs.cmu.edu/~tmalisie/projects/iccv11/index.html.
  22. A. Rosebrock, "(Faster) Non-maximum suppression in python", 16 Feb. 2015, [Online], Available: https://pyimagesearch.com/2015/02/16/faster-non-maximum-suppression-python/.
  23. Y. Lecun and C. Cortes, "MNIST handwritten digit database," ATT Labs, 2010, [Online] Available: http://yann.lecun.com/exdb/mnist/.
  24. Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, "Gradient-based learning applied to document recognition," in Proceedings of the IEEE, vol. 86, no.11, pp. 2278-2324, Nov. 1998. DOI: 10.1109/5.726791.