DOI QR코드

DOI QR Code

Korean automatic spacing using pretrained transformer encoder and analysis

  • Hwang, Taewook (Computer Science & Engineering, ChungNam National University) ;
  • Jung, Sangkeun (Computer Science & Engineering, ChungNam National University) ;
  • Roh, Yoon-Hyung (Language Intelligence Research Section, Electronics and Telecommunications Research Institute)
  • 투고 : 2020.03.13
  • 심사 : 2021.05.26
  • 발행 : 2021.12.01

초록

Automatic spacing in Korean is used to correct spacing units in a given input sentence. The demand for automatic spacing has been increasing owing to frequent incorrect spacing in recent media, such as the Internet and mobile networks. Therefore, herein, we propose a transformer encoder that reads a sentence bidirectionally and can be pretrained using an out-of-task corpus. Notably, our model exhibited the highest character accuracy (98.42%) among the existing automatic spacing models for Korean. We experimentally validated the effectiveness of bidirectional encoding and pretraining for automatic spacing in Korean. Moreover, we conclude that pretraining is more important than fine-tuning and data size.

키워드

과제정보

This work was supported with two grants by the Institute for Information and Communications Technology Promotion (IITP) funded by the Korea government (MSIT) (Nos. 2020-0-01441 and 2019-0-00004), Artificial Intelligence Convergence Research Center (Chungnam National University), Development of Semi-Supervised Learning Language Intelligence Technology and Korean Tutoring Service for Foreigners) and with the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (No. 2019R1F1A1060601).

참고문헌

  1. H.-S. Woo, A study on property and application of word spacing in Korean, J. North-East Asian Cult. 51 (2017), 73-94. https://doi.org/10.17949/jneac.1.51.201706.004
  2. H. Hwang and C. Lee, Automatic Korean word spacing using deep learning, in Proc. Korea Comput. Congr. (Jeju, South korea), June 2016, pp. 738-740.
  3. S.-W. Kim and S.-P. Choi, Research on joint models for Korean word spacing and POS (part-of-speech) tagging based on bidirectional LSTMCRF, J. Korean Inst. Inform. Sci. Eng. 45 (2018), no. 8, 792-800.
  4. C. Lee et al., Joint models for Korean word spacing and POS tagging using structural SVM, J. KIISE Softw. Appl. 40 (2013), no. 12, 826-832.
  5. A. Vaswani et al., Attention is all you need, in Proc. Conf. Neural Inf. Process. Syst. (NIPS 2017), (Long Beach, CA, USA), Dec. 2017, pp. 5998-6008.
  6. J. Devlin et al., Bert: Pre-training of deep bidirectional transformers for language understanding, arXiv preprint, CoRR, 2018, arXiv: 1810.04805.
  7. K.-H. Park et al., Bert for Korean natural language processing: Named entity tagging, sentiment analysis, dependency parsing and semantic role labeling, in Proc. KIISE (Jeju, South Korea), June 2019, pp. 584-586.
  8. S.-S. Kang, Eojeol-block bidirectional algorithm for automatic word spacing of Hangul sentences, J. KIISE: Softw. Appl. 27 (2000), no. 4, 441-447.
  9. K. Shim, Automated word-segmentation for Korean usingmutual information of syllables, J. Korea Inf. Sci. Soc. 9 (1996), 991-1000.
  10. S.-B. Park,Y.-S. Tae, and S.-Y. Park, Self-organizing n-grammodel for automatic word spacing, in Proc. Int. Conf. Comput. Linguistics & Annu. Meet. ACL (Sydney, Australia), July 2006, pp. 633-640.
  11. C.-K. Lee and H.-K. Kim, Automatic Korean word spacing using structural SVM, in Proc. Korea Comput. Congr. (Jeju, South Korea), June 2012, pp. 270-272.
  12. C. Lee and H. Kim, Automatic Korean word spacing using Pegasos algorithm, Inform. Process. Manag. 49 (2013), no. 1, 370-379. https://doi.org/10.1016/j.ipm.2012.05.004
  13. J.-H. Choi, P.-M. Ryu, and H.-J. Oh, Overview of automatic spacing and compound noun decomposition: 2018 Korean natural language processing contest, in Proc. Annu. Conf. Hum. Lang. Technol. (Seoul, South Korea), Oct. 2018, pp. 193-196.
  14. H. Kim and H. Kim, Effective integration of automatic word spacing and morphological analysis in Korean, in Proc. IEEE Int. Conf. Big Data Smart Comput. (BIGCOMP), (Busan, South Korea), Feb. 2020, pp. 275-278.
  15. S. Kim, G. Choi, and H. Kim, Reliable automatic word spacing using a space insertion and correction model based on neural networks in Korean, Inf. Process. Manag. 56 (2019), no. 3, 1046-1052. https://doi.org/10.1016/j.ipm.2019.02.015
  16. C. Park, H. Kim, and C. Lee, Korean morphological analysis and part-of-speech tagging with LSTM-CRF based on BERT, in Proc. Annu. Conf. Hum. Cogn. Lang. Technol. (Daejeon, South Korea), Oct. 2019, pp. 34-36.
  17. C. Lee et al., Korean Semantic Role Labeling with BERT, J. KIISE. 47 (2020), 1021-1026, http://www.dbpia.co.kr/journal/articleDetail?nodeId=NODE10487830 https://doi.org/10.5626/jok.2020.47.11.1021
  18. C. Park et al., Korean dependency parsing with BERT, in Proc. Korea Comput. Congr. (Jeju, South Korea), June 2019, pp. 530-532, http://www.dbpia.co.kr/journal/articleDetail?nodeId=NODE08763243
  19. C. Park et al., Korean end-to-end neural coreference resolution with BERT, J. KIISE. 47 (2020), 942-947, http://www.dbpia.co.kr/journal/articleDetail?nodeId=NODE10475010 https://doi.org/10.5626/jok.2020.47.10.942
  20. K. Clark et al., Electra: Pre-training text encoders as discriminators rather than generators, arXiv preprint, CoRR, 2020, arXiv: 2003.10555.
  21. Z. Huang,W. Xu, and K. Yu, Bidirectional LSTM-CRF models for sequence tagging, arXiv preprint, CoRR, 2015, arXiv: 1508.01991.
  22. J. Park, Koelectra: Pretrained electra model for Korean, GitHub, 2020, https://github.com/monologg/KoELECTRA