Korean automatic spacing using pretrained transformer encoder and analysis

Hwang, Taewook;Jung, Sangkeun;Roh, Yoon-Hyung;

doi:10.4218/etrij.2020-0092

ETRI Journal

Volume 43 Issue 6
/
Pages.1049-1057
/
2021
/
1225-6463(pISSN)
/
2233-7326(eISSN)

Electronics and Telecommunications Research Institute (한국전자통신연구원)

DOI QR Code

Korean automatic spacing using pretrained transformer encoder and analysis

Hwang, Taewook (Computer Science & Engineering, ChungNam National University) ;
Jung, Sangkeun (Computer Science & Engineering, ChungNam National University) ;
Roh, Yoon-Hyung (Language Intelligence Research Section, Electronics and Telecommunications Research Institute)

Received : 2020.03.13
Accepted : 2021.05.26
Published : 2021.12.01

https://doi.org/10.4218/etrij.2020-0092 Citation PDF KSCI

Download PDF

⟨ Previous Next ⟩

Abstract

Automatic spacing in Korean is used to correct spacing units in a given input sentence. The demand for automatic spacing has been increasing owing to frequent incorrect spacing in recent media, such as the Internet and mobile networks. Therefore, herein, we propose a transformer encoder that reads a sentence bidirectionally and can be pretrained using an out-of-task corpus. Notably, our model exhibited the highest character accuracy (98.42%) among the existing automatic spacing models for Korean. We experimentally validated the effectiveness of bidirectional encoding and pretraining for automatic spacing in Korean. Moreover, we conclude that pretraining is more important than fine-tuning and data size.

Keywords

Acknowledgement

This work was supported with two grants by the Institute for Information and Communications Technology Promotion (IITP) funded by the Korea government (MSIT) (Nos. 2020-0-01441 and 2019-0-00004), Artificial Intelligence Convergence Research Center (Chungnam National University), Development of Semi-Supervised Learning Language Intelligence Technology and Korean Tutoring Service for Foreigners) and with the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (No. 2019R1F1A1060601).

References

H.-S. Woo, A study on property and application of word spacing in Korean, J. North-East Asian Cult. 51 (2017), 73-94. https://doi.org/10.17949/jneac.1.51.201706.004
H. Hwang and C. Lee, Automatic Korean word spacing using deep learning, in Proc. Korea Comput. Congr. (Jeju, South korea), June 2016, pp. 738-740.
S.-W. Kim and S.-P. Choi, Research on joint models for Korean word spacing and POS (part-of-speech) tagging based on bidirectional LSTMCRF, J. Korean Inst. Inform. Sci. Eng. 45 (2018), no. 8, 792-800.
C. Lee et al., Joint models for Korean word spacing and POS tagging using structural SVM, J. KIISE Softw. Appl. 40 (2013), no. 12, 826-832.
A. Vaswani et al., Attention is all you need, in Proc. Conf. Neural Inf. Process. Syst. (NIPS 2017), (Long Beach, CA, USA), Dec. 2017, pp. 5998-6008.
J. Devlin et al., Bert: Pre-training of deep bidirectional transformers for language understanding, arXiv preprint, CoRR, 2018, arXiv: 1810.04805.
K.-H. Park et al., Bert for Korean natural language processing: Named entity tagging, sentiment analysis, dependency parsing and semantic role labeling, in Proc. KIISE (Jeju, South Korea), June 2019, pp. 584-586.
S.-S. Kang, Eojeol-block bidirectional algorithm for automatic word spacing of Hangul sentences, J. KIISE: Softw. Appl. 27 (2000), no. 4, 441-447.
K. Shim, Automated word-segmentation for Korean usingmutual information of syllables, J. Korea Inf. Sci. Soc. 9 (1996), 991-1000.
S.-B. Park,Y.-S. Tae, and S.-Y. Park, Self-organizing n-grammodel for automatic word spacing, in Proc. Int. Conf. Comput. Linguistics & Annu. Meet. ACL (Sydney, Australia), July 2006, pp. 633-640.
C.-K. Lee and H.-K. Kim, Automatic Korean word spacing using structural SVM, in Proc. Korea Comput. Congr. (Jeju, South Korea), June 2012, pp. 270-272.
C. Lee and H. Kim, Automatic Korean word spacing using Pegasos algorithm, Inform. Process. Manag. 49 (2013), no. 1, 370-379. https://doi.org/10.1016/j.ipm.2012.05.004
J.-H. Choi, P.-M. Ryu, and H.-J. Oh, Overview of automatic spacing and compound noun decomposition: 2018 Korean natural language processing contest, in Proc. Annu. Conf. Hum. Lang. Technol. (Seoul, South Korea), Oct. 2018, pp. 193-196.
H. Kim and H. Kim, Effective integration of automatic word spacing and morphological analysis in Korean, in Proc. IEEE Int. Conf. Big Data Smart Comput. (BIGCOMP), (Busan, South Korea), Feb. 2020, pp. 275-278.
S. Kim, G. Choi, and H. Kim, Reliable automatic word spacing using a space insertion and correction model based on neural networks in Korean, Inf. Process. Manag. 56 (2019), no. 3, 1046-1052. https://doi.org/10.1016/j.ipm.2019.02.015
C. Park, H. Kim, and C. Lee, Korean morphological analysis and part-of-speech tagging with LSTM-CRF based on BERT, in Proc. Annu. Conf. Hum. Cogn. Lang. Technol. (Daejeon, South Korea), Oct. 2019, pp. 34-36.
C. Lee et al., Korean Semantic Role Labeling with BERT, J. KIISE. 47 (2020), 1021-1026, http://www.dbpia.co.kr/journal/articleDetail?nodeId=NODE10487830 https://doi.org/10.5626/jok.2020.47.11.1021
C. Park et al., Korean dependency parsing with BERT, in Proc. Korea Comput. Congr. (Jeju, South Korea), June 2019, pp. 530-532, http://www.dbpia.co.kr/journal/articleDetail?nodeId=NODE08763243
C. Park et al., Korean end-to-end neural coreference resolution with BERT, J. KIISE. 47 (2020), 942-947, http://www.dbpia.co.kr/journal/articleDetail?nodeId=NODE10475010 https://doi.org/10.5626/jok.2020.47.10.942
K. Clark et al., Electra: Pre-training text encoders as discriminators rather than generators, arXiv preprint, CoRR, 2020, arXiv: 2003.10555.
Z. Huang,W. Xu, and K. Yu, Bidirectional LSTM-CRF models for sequence tagging, arXiv preprint, CoRR, 2015, arXiv: 1508.01991.
J. Park, Koelectra: Pretrained electra model for Korean, GitHub, 2020, https://github.com/monologg/KoELECTRA

ETRI Journal

Korean automatic spacing using pretrained transformer encoder and analysis

Abstract

Keywords

Acknowledgement

References

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)