Browse > Article
http://dx.doi.org/10.5351/KJAS.2019.32.2.213

Korean speech recognition using deep learning  

Lee, Suji (Department of Statistics, Seoul National University)
Han, Seokjin (Department of Statistics, Seoul National University)
Park, Sewon (Department of Statistics, Seoul National University)
Lee, Kyeongwon (Department of Statistics, Seoul National University)
Lee, Jaeyong (Department of Statistics, Seoul National University)
Publication Information
The Korean Journal of Applied Statistics / v.32, no.2, 2019 , pp. 213-227 More about this Journal
Abstract
In this paper, we propose an end-to-end deep learning model combining Bayesian neural network with Korean speech recognition. In the past, Korean speech recognition was a complicated task due to the excessive parameters of many intermediate steps and needs for Korean expertise knowledge. Fortunately, Korean speech recognition becomes manageable with the aid of recent breakthroughs in "End-to-end" model. The end-to-end model decodes mel-frequency cepstral coefficients directly as text without any intermediate processes. Especially, Connectionist Temporal Classification loss and Attention based model are a kind of the end-to-end. In addition, we combine Bayesian neural network to implement the end-to-end model and obtain Monte Carlo estimates. Finally, we carry out our experiments on the "WorimalSam" online dictionary dataset. We obtain 4.58% Word Error Rate showing improved results compared to Google and Naver API.
Keywords
Korean speech recognition; end to end deep learning; Connectionist temporal classification; Attention; Bayesian deep learning;
Citations & Related Records
연도 인용수 순위
  • Reference
1 Bahdanau, D., Cho, K., and Bengio, Y. (2014). Neural Machine Translation by Jointly Learning to Align and Translate. arXiv preprint arXiv:1409.0473
2 Bengio, Y., Simard, P., and Frasconi, P. (1994). Learning long-term dependencies with gradient descent is difficult, IEEE Transactions on Neural Networks, 5, 157-166.   DOI
3 Blundell, C., Cornebise, J., Kavukcuoglu, K., and Wierstra, D. (2015). Weight uncertainty in neural networks, arXiv preprint, arXiv:1505.05424
4 Chan, W., Jaitly, N., Le, Q. V., and Vinyals, O. (2015). Listen, attend and spell: A neural network for large vocabulary conversational speech recognition. In Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE International Conference on, 4960-4964. IEEE, 2016.
5 Cho, K., van Merrienboer, B., Bahdanau, D., and Bengio, Y. (2014). On the properties of neural machine translation: Encoder-Decoder approaches. In Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation.
6 Cho, K., van Merrienboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., and Bengio, Y. (2014). Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation, arXiv preprint, arXiv:1406.1078
7 Chung, J., Gulcehre, C., Cho, K., and Bengio, Y. (2014). Empirical evaluation of gated recurrent neural networks on sequence modeling, arXiv preprint, arXiv:1412.3555
8 Gal, Y. and Ghahramani, Z. (2016a). Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In International Conference on Machine Learning, 1050-1059.
9 Gal, Y. and Ghahramani, Z. (2016b). A theoretically grounded application of dropout in recurrent neural networks. In Advances in Neural Information Processing Systems, 1019-1027.
10 Goodfellow, I., Bengio, Y., Courville, A., and Bengio, Y. (2016). Deep learning (Vol. 1), MIT press, Cambridge.
11 Graves, A., Fernandez, S., Gomez, F., and Schmidhuber, J. (2006). Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd international conference on Machine learning, 369-376. ACM.
12 Gales, M., and Young, S. (2008). The application of hidden Markov models in speech recognition, Foundations and Trends ${\mu}l$ kpa in Signal Processing, 1, 195-304.   DOI
13 Hochreiter, S., and Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9, 1735-1780.   DOI
14 Huang, X., Acero, A., and Hon, H. (2001). Spoken Language Processing: A Guide to Theory, Algorithm, and System Development, Prentice hall PTR, New Jersey.
15 Jelinek, F. (1997). Statistical Methods for Speech Recognition, MIT press, Cambridge.
16 Kim, S., Hori, T., andWatanabe, S. (2017). Joint CTC-attention based end-to-end speech recognition using multi-task learning. In Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on, 4835-4839. IEEE.
17 Kingma, D. P. and Ba, J. (2014). Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980.
18 Luong, M.-T., Pham, H., and Manning, C. D. (2015). Effective Approaches to Attention-based Neural Machine Translation. arXiv preprint arXiv:1508.04025
19 Kwon, O. W. and Park, J. (2003). Korean large vocabulary continuous speech recognition with morphemebased recognition units, Speech Communication, 39, 287-300.   DOI