DOI QR코드

DOI QR Code

Korean speech recognition using deep learning

딥러닝 모형을 사용한 한국어 음성인식

  • Lee, Suji (Department of Statistics, Seoul National University) ;
  • Han, Seokjin (Department of Statistics, Seoul National University) ;
  • Park, Sewon (Department of Statistics, Seoul National University) ;
  • Lee, Kyeongwon (Department of Statistics, Seoul National University) ;
  • Lee, Jaeyong (Department of Statistics, Seoul National University)
  • Received : 2018.12.21
  • Accepted : 2019.02.14
  • Published : 2019.04.30

Abstract

In this paper, we propose an end-to-end deep learning model combining Bayesian neural network with Korean speech recognition. In the past, Korean speech recognition was a complicated task due to the excessive parameters of many intermediate steps and needs for Korean expertise knowledge. Fortunately, Korean speech recognition becomes manageable with the aid of recent breakthroughs in "End-to-end" model. The end-to-end model decodes mel-frequency cepstral coefficients directly as text without any intermediate processes. Especially, Connectionist Temporal Classification loss and Attention based model are a kind of the end-to-end. In addition, we combine Bayesian neural network to implement the end-to-end model and obtain Monte Carlo estimates. Finally, we carry out our experiments on the "WorimalSam" online dictionary dataset. We obtain 4.58% Word Error Rate showing improved results compared to Google and Naver API.

본 논문에서는 베이즈 신경망을 결합한 종단 간 딥러닝 모형을 한국어 음성인식에 적용하였다. 논문에서는 종단 간 학습 모형으로 연결성 시계열 분류기(connectionist temporal classification), 주의 기제, 그리고 주의 기제에 연결성 시계열 분류기를 결합한 모형을 사용하였으며. 각 모형은 순환신경망(recurrent neural network) 혹은 합성곱신경망(convolutional neural network)을 기반으로 하였다. 추가적으로 디코딩 과정에서 빔 탐색과 유한 상태 오토마타를 활용하여 자모음 순서를 조정한 최적의 문자열을 도출하였다. 또한 베이즈 신경망을 각 종단 간 모형에 적용하여 일반적인 점 추정치와 몬테카를로 추정치를 구하였으며 이를 기존 종단 간 모형의 결괏값과 비교하였다. 최종적으로 본 논문에 제안된 모형 중에 가장 성능이 우수한 모형을 선택하여 현재 상용되고 있는 Application Programming Interface (API)들과 성능을 비교하였다. 우리말샘 온라인 사전 훈련 데이터에 한하여 비교한 결과, 제안된 모형의 word error rate (WER)와 label error rate (LER)는 각각 26.4%와 4.58%로서 76%의 WER와 29.88%의 LER 값을 보인 Google API보다 월등히 개선된 성능을 보였다.

Keywords

GCGHDE_2019_v32n2_213_f0001.png 이미지

Figure 2.1. Sequence to sequence model.

GCGHDE_2019_v32n2_213_f0002.png 이미지

Figure 2.2. Attention model (Bahdanau et al., 2014).

GCGHDE_2019_v32n2_213_f0003.png 이미지

Figure 4.1. Mel-frequency cepstral coefficients.

GCGHDE_2019_v32n2_213_f0004.png 이미지

Figure 4.2. The structure of the encoder.

GCGHDE_2019_v32n2_213_f0005.png 이미지

Figure 4.3. A finite automata that searches for correct Korean strings.

Table 5.1. Performance comparison between end-to-end deep learning models

GCGHDE_2019_v32n2_213_t0001.png 이미지

Table 5.2. Performance comparison when adding a finite automata language model

GCGHDE_2019_v32n2_213_t0002.png 이미지

Table 5.3. Performance comparison with commercial API

GCGHDE_2019_v32n2_213_t0003.png 이미지

References

  1. Bahdanau, D., Cho, K., and Bengio, Y. (2014). Neural Machine Translation by Jointly Learning to Align and Translate. arXiv preprint arXiv:1409.0473
  2. Bengio, Y., Simard, P., and Frasconi, P. (1994). Learning long-term dependencies with gradient descent is difficult, IEEE Transactions on Neural Networks, 5, 157-166. https://doi.org/10.1109/72.279181
  3. Blundell, C., Cornebise, J., Kavukcuoglu, K., and Wierstra, D. (2015). Weight uncertainty in neural networks, arXiv preprint, arXiv:1505.05424
  4. Chan, W., Jaitly, N., Le, Q. V., and Vinyals, O. (2015). Listen, attend and spell: A neural network for large vocabulary conversational speech recognition. In Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE International Conference on, 4960-4964. IEEE, 2016.
  5. Cho, K., van Merrienboer, B., Bahdanau, D., and Bengio, Y. (2014). On the properties of neural machine translation: Encoder-Decoder approaches. In Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation.
  6. Cho, K., van Merrienboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., and Bengio, Y. (2014). Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation, arXiv preprint, arXiv:1406.1078
  7. Chung, J., Gulcehre, C., Cho, K., and Bengio, Y. (2014). Empirical evaluation of gated recurrent neural networks on sequence modeling, arXiv preprint, arXiv:1412.3555
  8. Gal, Y. and Ghahramani, Z. (2016a). Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In International Conference on Machine Learning, 1050-1059.
  9. Gal, Y. and Ghahramani, Z. (2016b). A theoretically grounded application of dropout in recurrent neural networks. In Advances in Neural Information Processing Systems, 1019-1027.
  10. Goodfellow, I., Bengio, Y., Courville, A., and Bengio, Y. (2016). Deep learning (Vol. 1), MIT press, Cambridge.
  11. Graves, A., Fernandez, S., Gomez, F., and Schmidhuber, J. (2006). Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd international conference on Machine learning, 369-376. ACM.
  12. Gales, M., and Young, S. (2008). The application of hidden Markov models in speech recognition, Foundations and Trends ${\mu}l$ kpa in Signal Processing, 1, 195-304. https://doi.org/10.1561/2000000004
  13. Hochreiter, S., and Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9, 1735-1780. https://doi.org/10.1162/neco.1997.9.8.1735
  14. Huang, X., Acero, A., and Hon, H. (2001). Spoken Language Processing: A Guide to Theory, Algorithm, and System Development, Prentice hall PTR, New Jersey.
  15. Jelinek, F. (1997). Statistical Methods for Speech Recognition, MIT press, Cambridge.
  16. Kim, S., Hori, T., andWatanabe, S. (2017). Joint CTC-attention based end-to-end speech recognition using multi-task learning. In Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on, 4835-4839. IEEE.
  17. Kingma, D. P. and Ba, J. (2014). Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980.
  18. Kwon, O. W. and Park, J. (2003). Korean large vocabulary continuous speech recognition with morphemebased recognition units, Speech Communication, 39, 287-300. https://doi.org/10.1016/S0167-6393(02)00031-6
  19. Luong, M.-T., Pham, H., and Manning, C. D. (2015). Effective Approaches to Attention-based Neural Machine Translation. arXiv preprint arXiv:1508.04025