End-to-end speech recognition models using limited training data

제한된 학습 데이터를 사용하는 End-to-End 음성 인식 모델

  • Kim, June-Woo (Department of Artificial Intelligence, Kyungpook National University) ;
  • Jung, Ho-Young (Department of Artificial Intelligence, Kyungpook National University)
  • 김준우 (경북대학교 인공지능학과) ;
  • 정호영 (경북대학교 인공지능학과)
  • Received : 2020.07.31
  • Accepted : 2020.11.16
  • Published : 2020.12.31


Speech recognition is one of the areas actively commercialized using deep learning and machine learning techniques. However, the majority of speech recognition systems on the market are developed on data with limited diversity of speakers and tend to perform well on typical adult speakers only. This is because most of the speech recognition models are generally learned using a speech database obtained from adult males and females. This tends to cause problems in recognizing the speech of the elderly, children and people with dialects well. To solve these problems, it may be necessary to retain big database or to collect a data for applying a speaker adaptation. However, this paper proposes that a new end-to-end speech recognition method consists of an acoustic augmented recurrent encoder and a transformer decoder with linguistic prediction. The proposed method can bring about the reliable performance of acoustic and language models in limited data conditions. The proposed method was evaluated to recognize Korean elderly and children speech with limited amount of training data and showed the better performance compared of a conventional method.

음성 인식은 딥러닝 및 머신러닝 분야에서 활발히 상용화 되고 있는 분야 중 하나이다. 그러나, 현재 개발되고 있는 음성 인식 시스템은 대부분 성인 남녀를 대상으로 인식이 잘 되는 실정이다. 이것은 음성 인식 모델이 대부분 성인 남녀 음성 데이터베이스를 학습하여 구축된 모델이기 때문이다. 따라서, 노인, 어린이 및 사투리를 갖는 화자의 음성을 인식하는데 문제를 일으키는 경향이 있다. 노인과 어린이의 음성을 잘 인식하기 위해서는 빅데이터를 구축하는 방법과 성인 대상 음성 인식 엔진을 노인 및 어린이 데이터로 적응하는 방법 등이 있을 수 있지만, 본 논문에서는 음향적 데이터 증강에 기반한 재귀적 인코더와 언어적 예측이 가능한 transformer 디코더로 구성된 새로운 end-to-end 모델을 제안한다. 제한된 데이터셋으로 구성된 한국어 노인 및 어린이 음성 인식을 통해 제안된 방법의 성능을 평가한다.



  1. Bahdanau, D., Cho, K., & Bengio, Y. (2015, January). Neural machine translation by jointly learning to align and translate. Proceedings of the 3rd International Conference on Learning Representations, ICLR 2015. San Diego, CA.
  2. Chan, W., Jaitly, N., Le, Q., & Vinyals, O. (2016, March). Listen, attend and spell: A neural network for large vocabulary conversational speech recognition. Proceedings of the 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 4960-4964). Shanghai, China.
  3. Chorowski, J. K., Bahdanau, D., Serdyuk, D., Cho, K., & Bengio, Y. (2015). Attention-based models for speech recognition. Advances in Neural Information Processing Systems, 2015, 577-585.
  4. Chorowski, J., Bahdanau, D., Cho, K., & Bengio, Y. (2014, December). End-to-end continuous speech recognition using attention-based recurrent NN: First results. Proceedings of the NIPS 2014 Workshop on Deep Learning. Montreal, Canada.
  5. Graves, A., Fernandez, S., Gomez, F., & Schmidhuber, J. (2006, June). Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks. Proceedings of the 23rd International Conference on Machine Learning (pp. 369-376). Pittsburgh, PA.
  6. Hinton, G., Deng, L., Yu, D., Dahl, G. E., Mohamed, A. R., Jaitly, N., Senior, A., ... Kingsbury, B. (2012). Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Processing Magazine, 29(6), 82-97.
  7. Ioffe, S., & Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. Proceedings of the 32nd International Conference on Machine Learning, PMLR (pp. 448-456). Mountain View, CA.
  8. Levy, O., & Goldberg, Y. (2014). Neural word embedding as implicit matrix factorization. In Z. Ghahramani., M. Welling, C. Cortes, N. Lawrence & K. Q. Weinberger (Eds.), Advances in neural information processing systems (pp. 2177-2185). Red Hook, NY: Curran Associates.
  9. Luong, M. T., Pham, H., & Manning, C. D. (2015, September). Effective approaches to attention-based neural machine translation. Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (pp. 1412-1421). Lisbon, Portugal.
  10. Mikolov, T., Kombrink, S., Burget, L., Cernocky, J., & Khudanpur, S. (2011, May). Extensions of recurrent neural network language model. Proceedings of the 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 5528-5531). Prague, Czech Republic.
  11. Nair, V., & Hinton, G. E. (2010, January). Rectified linear units improve restricted boltzmann machines. Proceedings of the 27th International Conference on International Conference on Machine Learning (ICML) (pp. 807-814). Haifa, Israel.
  12. Paul, D. B., & Baker, J. (1992, February). The design for the wall street journal-based CSR corpus. Proceedings of the Speech and Natural Language: Proceedings of a Workshop Held at Harriman. New York, NY.
  13. Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., Hannemann, M., ... Silovsky, J. (2011, December). The Kaldi speech recognition toolkit. Proceedings of the IEEE 2011 Workshop on Automatic Speech Recognition and Understanding. Big Island, HI.
  14. Ranzato, M. A., Huang, F. J., Boureau, Y. L., & LeCun, Y. (2007, June). Unsupervised learning of invariant feature hierarchies with applications to object recognition. Proceedings of the 2007 IEEE Conference on Computer Vision and Pattern Recognition (pp. 1-8). Minneapolis, MN.
  15. Sainath, T. N., Vinyals, O., Senior, A., & Sak, H. (2015, April). Convolutional, long short-term memory, fully connected deep neural networks. Proceedings of the 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 4580-4584). Brisbane, Australia.
  16. Schuster, M., & Paliwal, K. K. (1997). Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing, 45(11), 2673-2681.
  17. Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. arXiv. Retrieved from
  18. Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2014). Dropout: A simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, 15(1), 1929-1958.
  19. Sutskever, I., Vinyals, O., & Le, Q. V. (2014). Sequence to sequence learning with neural networks. In Z. Ghahramani., M. Welling, C. Cortes, N. Lawrence, & K. Q. Weinberger (Eds.), Advances in neural information processing systems (pp. 3104-3112). Red Hook, NY: Curran Associates.
  20. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017). Attention is all you need. In U. von Luxburg, A. Bengio, R. Fergus, R. Garnett, I. Guyon, H. Wallach, & S. V. N. Vishwanathan (Eds.), Advances in neural information processing systems (pp. 5998-6008). Red Hook, NY: Curran Associates.
  21. Watanabe, S., Hori, T., Karita, S., Hayashi, T., Nishitoba, J., Unno, Y., Soplin, N. E. Y., ... Renduchintala, A. (2018). Espnet: End-to-end speech processing toolkit. arXiv. Retrieved from