[KSCI] Korea Science Citation Index Service

http://dx.doi.org/10.13064/KSSS.2020.12.4.063

End-to-end speech recognition models using limited training data

Kim, June-Woo (Department of Artificial Intelligence, Kyungpook National University)
Jung, Ho-Young (Department of Artificial Intelligence, Kyungpook National University)

Publication Information

Phonetics and Speech Sciences / v.12, no.4, 2020 , pp. 63-71 More about this Journal

Abstract

Speech recognition is one of the areas actively commercialized using deep learning and machine learning techniques. However, the majority of speech recognition systems on the market are developed on data with limited diversity of speakers and tend to perform well on typical adult speakers only. This is because most of the speech recognition models are generally learned using a speech database obtained from adult males and females. This tends to cause problems in recognizing the speech of the elderly, children and people with dialects well. To solve these problems, it may be necessary to retain big database or to collect a data for applying a speaker adaptation. However, this paper proposes that a new end-to-end speech recognition method consists of an acoustic augmented recurrent encoder and a transformer decoder with linguistic prediction. The proposed method can bring about the reliable performance of acoustic and language models in limited data conditions. The proposed method was evaluated to recognize Korean elderly and children speech with limited amount of training data and showed the better performance compared of a conventional method.

Keywords

speech recognition; end-to-end model; small-data speech recognition;

Citations & Related Records

Reference

1	Bahdanau, D., Cho, K., & Bengio, Y. (2015, January). Neural machine translation by jointly learning to align and translate. Proceedings of the 3rd International Conference on Learning Representations, ICLR 2015. San Diego, CA.
2	Chan, W., Jaitly, N., Le, Q., & Vinyals, O. (2016, March). Listen, attend and spell: A neural network for large vocabulary conversational speech recognition. Proceedings of the 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 4960-4964). Shanghai, China.
3	Chorowski, J. K., Bahdanau, D., Serdyuk, D., Cho, K., & Bengio, Y. (2015). Attention-based models for speech recognition. Advances in Neural Information Processing Systems, 2015, 577-585.
4	Chorowski, J., Bahdanau, D., Cho, K., & Bengio, Y. (2014, December). End-to-end continuous speech recognition using attention-based recurrent NN: First results. Proceedings of the NIPS 2014 Workshop on Deep Learning. Montreal, Canada.
5	Graves, A., Fernandez, S., Gomez, F., & Schmidhuber, J. (2006, June). Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks. Proceedings of the 23rd International Conference on Machine Learning (pp. 369-376). Pittsburgh, PA.
6	Hinton, G., Deng, L., Yu, D., Dahl, G. E., Mohamed, A. R., Jaitly, N., Senior, A., ... Kingsbury, B. (2012). Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Processing Magazine, 29(6), 82-97. DOI
7	Ioffe, S., & Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. Proceedings of the 32nd International Conference on Machine Learning, PMLR (pp. 448-456). Mountain View, CA.
8	Mikolov, T., Kombrink, S., Burget, L., Cernocky, J., & Khudanpur, S. (2011, May). Extensions of recurrent neural network language model. Proceedings of the 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 5528-5531). Prague, Czech Republic.
9	Levy, O., & Goldberg, Y. (2014). Neural word embedding as implicit matrix factorization. In Z. Ghahramani., M. Welling, C. Cortes, N. Lawrence & K. Q. Weinberger (Eds.), Advances in neural information processing systems (pp. 2177-2185). Red Hook, NY: Curran Associates.
10	Luong, M. T., Pham, H., & Manning, C. D. (2015, September). Effective approaches to attention-based neural machine translation. Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (pp. 1412-1421). Lisbon, Portugal.
11	Nair, V., & Hinton, G. E. (2010, January). Rectified linear units improve restricted boltzmann machines. Proceedings of the 27th International Conference on International Conference on Machine Learning (ICML) (pp. 807-814). Haifa, Israel.
12	Paul, D. B., & Baker, J. (1992, February). The design for the wall street journal-based CSR corpus. Proceedings of the Speech and Natural Language: Proceedings of a Workshop Held at Harriman. New York, NY.
13	Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., Hannemann, M., ... Silovsky, J. (2011, December). The Kaldi speech recognition toolkit. Proceedings of the IEEE 2011 Workshop on Automatic Speech Recognition and Understanding. Big Island, HI.
14	Sutskever, I., Vinyals, O., & Le, Q. V. (2014). Sequence to sequence learning with neural networks. In Z. Ghahramani., M. Welling, C. Cortes, N. Lawrence, & K. Q. Weinberger (Eds.), Advances in neural information processing systems (pp. 3104-3112). Red Hook, NY: Curran Associates.
15	Ranzato, M. A., Huang, F. J., Boureau, Y. L., & LeCun, Y. (2007, June). Unsupervised learning of invariant feature hierarchies with applications to object recognition. Proceedings of the 2007 IEEE Conference on Computer Vision and Pattern Recognition (pp. 1-8). Minneapolis, MN.
16	Sainath, T. N., Vinyals, O., Senior, A., & Sak, H. (2015, April). Convolutional, long short-term memory, fully connected deep neural networks. Proceedings of the 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 4580-4584). Brisbane, Australia.
17	Schuster, M., & Paliwal, K. K. (1997). Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing, 45(11), 2673-2681. DOI
18	Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. arXiv. Retrieved from https://arxiv.org/abs/1409.1556
19	Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2014). Dropout: A simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, 15(1), 1929-1958.
20	Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017). Attention is all you need. In U. von Luxburg, A. Bengio, R. Fergus, R. Garnett, I. Guyon, H. Wallach, & S. V. N. Vishwanathan (Eds.), Advances in neural information processing systems (pp. 5998-6008). Red Hook, NY: Curran Associates.
21	Watanabe, S., Hori, T., Karita, S., Hayashi, T., Nishitoba, J., Unno, Y., Soplin, N. E. Y., ... Renduchintala, A. (2018). Espnet: End-to-end speech processing toolkit. arXiv. Retrieved from https://arxiv.org/abs/1804.00015

KSCI

End-to-end speech recognition models using limited training data 제한된 학습 데이터를 사용하는 End-to-End 음성 인식 모델

End-to-end speech recognition models using limited training data