[KSCI] Korea Science Citation Index Service

http://dx.doi.org/10.7776/ASK.2021.40.5.488

A Korean speech recognition based on conformer 콘포머 기반 한국어 음성인식

Koo, Myoung-Wan (Department of Computer Science and Engineer, Sogang University)

Publication Information

The Journal of the Acoustical Society of Korea / v.40, no.5, 2021 , pp. 488-495 More about this Journal

Abstract

We propose a speech recognition system based on conformer. Conformer is known to be convolution-augmented transformer, which combines transfer model for capturing global information with Convolution Neural Network (CNN) for exploiting local feature effectively. The baseline system is developed to be a transfer-based speech recognition using Long Short-Term Memory (LSTM)-based language model. The proposed system is a system which uses conformer instead of transformer with transformer-based language model. When Electronics and Telecommunications Research Institute (ETRI) speech corpus in AI-Hub is used for our evaluation, the proposed system yields 5.7 % of Character Error Rate (CER) while the baseline system results in 11.8 % of CER. Even though speech corpus is extended into other domain of AI-hub such as NHNdiguest speech corpus, the proposed system makes a robust performance for two domains. Throughout those experiments, we can prove a validation of the proposed system.

Keywords

Speech recognition; Deep learning; Conformer; Transformer;

Citations & Related Records

Reference

1	Y. Tachioka, T. Narita, L. Miura, T. Uramoto, N. Monta, S. Uenohara, K. Furuya, S. Wanatanabe, and J. Le Roux, "Coupled initialization of multi-channel non-negative matrix factorization based on spatial and spectral information," Proc. Interspeech, 2461-2465 (2017).
2	L. Dong, S. Xu, and B. Xu, "Speech-transformer: A no-recurrence sequence-to-sequence model for speech recognition," Proc. ICASSP. 5884-5888 (2018).
3	W. Chan, N. Jaitly, Q. Le, and O. Vinyals, "Listen, attend and spell: A neural network for large vocabulary conversational speech recognition," Proc. ICASSP. 4960-4964 (2016).
4	A. Graves, A. r. Mohamed, and G. Hinton, "Speech recognition with deep recurrent neural networks," arXiv:1303.5778 (2013).
5	A. Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu, "Wavenet: a generative model for raw audio," arXiv:1609.03499 (2016).
6	N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, "Attention is all you need," Proc. NIPS 1-11 (2017).
7	Y. N. Dauphin, A. Fan, M. Auli, and D. Grangier, "Language modeling with gated convolutional networks," arXiv:1612.08083v3 (2017).
8	S. Kim, S. Bae, and C. Won, "Open-source toolkit for end-to-end Korean speech recognition," Software Impacts, 7, 1-4 (2021).
9	D. Amodei, R. Anubhai, E. Battenberg, C. Case, J. Casper, B. Catanzaro, J. Chen, M. Chrzanowski, A. Coates, G. Diamos, E. Elsen, J. Engel, L. Fan, C. Fougner, T. Han, A. Hannun, B. Jun, P. LeGresley, L. Lin, S. Narang, A. Ng, S. Ozair, R. Prenger, J. Raiman, S. Satheesh, D. Seetapun, S. Sengupta, Y. Wang, Z. Wang, C. Wang, B. Xiao, D. Yogatama, J. Zhan, and Z. Zhu, "Deep speech 2: end-to-end speech recognition in English and Mandarim," arXiv:1512.02595v1 (2015).
10	S. Zhao, X. Xiao, Z. Zhang, T. N. T. Nguyen, X. Zhong, B. Ren, L. Wang, D. L. Jones, E. S. Chng, and H. Li, "Robust speech recognition using beamforming with adaptive microphone gains and multichannel noise reductio," Proc. 2015 IEEE ASRU. 460-467 (2015).
11	P. Ramachandran, B. Zoph, and Q. V. Le, " Swish: A self-gated activation function," arXiv:1710.05941v1 (2017).
12	A. Graves and N. Jaitly, "Towards end-to-end speech recognition with recurrent neural networks," Proc. ICML. 1764-1772 (2014).