Browse > Article
http://dx.doi.org/10.3745/KTSDE.2019.8.11.441

Automatic Word Spacing of the Korean Sentences by Using End-to-End Deep Neural Network  

Lee, Hyun Young (국민대학교 컴퓨터공학과)
Kang, Seung Shik (국민대학교 소프트웨어학부)
Publication Information
KIPS Transactions on Software and Data Engineering / v.8, no.11, 2019 , pp. 441-448 More about this Journal
Abstract
Previous researches on automatic spacing of Korean sentences has been researched to correct spacing errors by using n-gram based statistical techniques or morpheme analyzer to insert blanks in the word boundary. In this paper, we propose an end-to-end automatic word spacing by using deep neural network. Automatic word spacing problem could be defined as a tag classification problem in unit of syllable other than word. For contextual representation between syllables, Bi-LSTM encodes the dependency relationship between syllables into a fixed-length vector of continuous vector space using forward and backward LSTM cell. In order to conduct automatic word spacing of Korean sentences, after a fixed-length contextual vector by Bi-LSTM is classified into auto-spacing tag(B or I), the blank is inserted in the front of B tag. For tag classification method, we compose three types of classification neural networks. One is feedforward neural network, another is neural network language model and the other is linear-chain CRF. To compare our models, we measure the performance of automatic word spacing depending on the three of classification networks. linear-chain CRF of them used as classification neural network shows better performance than other models. We used KCC150 corpus as a training and testing data.
Keywords
Syllable Embedding; Bi-LSTM; Feedforward Neural Network; Neural Network Language Model; Linear-Chain CRF;
Citations & Related Records
Times Cited By KSCI : 1  (Citation Analysis)
연도 인용수 순위
1 K. S. Shim, "Automatic Word Spacing using Raw Corpus and a Morphological Analyzer," Journal of KIISE, Vol.42, No.1, pp.68-75, 2015.   DOI
2 K. S. Kim, H. J. Lee, and S. J. Lee, "Three-stage Wordspacing System for Continuous Syllable Sentence in Korea," Journal of KISS(B): Software and Applications, Vol.25, No.12, pp.1838-1844, 1998.
3 S. S. Kang, "Eojeol-block Bidirectional Algorithm for Automatic Word Spacing of Hangul Sentences," Journal of KISS : Software and Applications, Vol.27, No.4, pp.441-447, 2000.
4 C. K. Lee, "Structural SVM-based Korean Word Spacing using Spacing Information Input by Users," Journal of KIISE: Computing Practices and Letters, Vol.20, No.5, pp.301-305, 2014.
5 H. S. Hwang and C. K. Lee, "Automatic Korean Word Spacing using Deep Learning," in Korea Computer Congress of KIISE, Jeju, The South Korea, 2016, pp.738-740.
6 T. S. Lee and S. S. Kang, "LSTM Based Sequence-tosequence Model for Korean Automatic Word-spacing," Smart Media Journal, Vol.7, No.4, pp.17-23, 2018.
7 Heewon Jeon, "KoSpacing: Automatic Korean Word Spacing," GitHub Repository, https://github.com/haven-jeon/PyKoSpacing, http://freesearch.pe.kr/archives/4759, 2018.
8 T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, "Distributed Representations of Words and Phrases and Their Compositionality," in Advances in Neural Information Processing Systems, Lake Tahoe, the United States, 2013, pp.3111-3119.
9 T. Mikolov, K. Chen, G. Corrado, and J. Dean, "Efficient Estimation of Word Representations in Vector Space," arXiv Preprint arXiv:1301.3781, 2013.
10 P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov, "Enriching Word Vectors with Subword Information," Transactions of the Association for Computational Linguistics, Vol 5, pp.135-146, 2017.   DOI
11 J. Pennington, R. Socher, and C. Manning, "Glove: Global Vectors for Word Representation," in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 2014, pp.1532-1543.
12 T. Mikolov, M. Karafiat, L. Burget, J. Cernocky and S. Khudanpur, "Recurrent Neural Network Based Language Model," in Eleventh Annual Conference of the International Speech Communication Association, Makuhari, Chiba, Japan, 2010, pp.1045-1048.
13 S. W. Kim, and S. P. Choi, "Research on Joint Models for Korean Word Spacing and POS (Part-Of-Speech) Tagging Based on Bidirectional LSTM-CRF," Journal of KIISE, Vol.45, No.8, pp.792-800, Aug, 2018.   DOI
14 Z. H. Huang, W. Xu, and K. Yu, "Bidirectional LSTM-CRF Models for Sequence Tagging," arXiv Preprint arXiv: 1508.01991, 2015.
15 A. Krizhevsky, I. Sutskever, and G. E. Hinton, "Imagenet Classification with Deep Convolutional Neural Networks," in Advances in Neural Information Processing Systems. Harrahs and Harveys, Lake Tahoe, the United States, 2012, pp.1097-1105.
16 R. Jozefowicz, O. Vinyals, M. Schuster, N. Shazeer, and Y. Wu, "Exploring the Limits of Language Modeling," arXiv Preprint arXiv:1602.02410, 2016.
17 T. Mikolov, S. Kombrink, L. Burget, J. Cernocky, and S. Khudanpur, "Extensions of Recurrent Neural Network Language Model," in 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Prague, Czech Republic, 2011, pp.5528-5531.
18 Y. Bengio, R. Ducharme, P. Vincent, and C. Jauvin, "A Neural Probabilistic Language Model," Journal of Machine Learning Research, Vol 3, pp.1137-1155, 2003.
19 M. Sundermeyer, R. Schluter, and H. Ney, "LSTM Neural Networks for Language Modeling," in Thirteenth Annual Conference of the International Speech Communication Association, Portland, OR, USA, 2012, pp.194-197.