Browse > Article
http://dx.doi.org/10.9728/dcs.2017.18.1.93

LSTM RNN-based Korean Speech Recognition System Using CTC  

Lee, Donghyun (Department of Computer Science and Engineering, Sogang University)
Lim, Minkyu (Department of Computer Science and Engineering, Sogang University)
Park, Hosung (Department of Computer Science and Engineering, Sogang University)
Kim, Ji-Hwan (Department of Computer Science and Engineering, Sogang University)
Publication Information
Journal of Digital Contents Society / v.18, no.1, 2017 , pp. 93-99 More about this Journal
Abstract
A hybrid approach using Long Short Term Memory (LSTM) Recurrent Neural Network (RNN) has showed great improvement in speech recognition accuracy. For training acoustic model based on hybrid approach, it requires forced alignment of HMM state sequence from Gaussian Mixture Model (GMM)-Hidden Markov Model (HMM). However, high computation time for training GMM-HMM is required. This paper proposes an end-to-end approach for LSTM RNN-based Korean speech recognition to improve learning speed. A Connectionist Temporal Classification (CTC) algorithm is proposed to implement this approach. The proposed method showed almost equal performance in recognition rate, while the learning speed is 1.27 times faster.
Keywords
Connectionist temporal classification; Long short term memory; Recurrent neural network; Acoustic model; Speech recognition;
Citations & Related Records
Times Cited By KSCI : 1  (Citation Analysis)
연도 인용수 순위
1 A. Acero et al., "Live search for mobile: web services by voice on the cellphone," in Proceeding of the Interspeech, Brisbane, Australia, pp. 5256-5259, 2008.
2 J. Jiang et al., Automatic online evaluation of intelligent assistants, in Opportunities and Challenges for Next-Generation Applied Intelligence, Berlin, Germany: Springer, pp. 285-290, 2009.
3 S. Kim and J. Ahn, "Speech Recognition System in Car Noise Environment," The Journal of Digital Contents Society, Vol. 10, No. 1, pp. 121-127, Mar. 2009.
4 L. Rabiner and B. Juang, Fundamentals of Speech Recognition, 1st ed. Englewood Cliffs, NJ: Prentice Hall, 1993.
5 D. Su, X. Wu, and L. Xu, "GMM-HMM acoustic model training by a two level procedure with gaussian components determined by automatic model selection," in Proceeding of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Dallas: TX, pp. 4890-4893, 2010.
6 H. Hermansky, D. Ellis, and S. Sharma, "Tandem connectionist feature extraction for conventional HMM systems," in Proceeding of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Istanbul, Turkey, pp. 1635-1638, 2000.
7 T. Mikolov and G. Zweig, Context dependent recurrent neural network language model, Microsoft Research, Redmond: WA, Technical Report MSR-TR-2012-92, 2012.
8 A. Graves et al., "Hybrid speech recognition with deep bidirectional LSTM," in Proceeding of the IEEE Automatic Speech Recognition and Understanding Workshop, Olomouc, Czech Republic, pp. 273-278, 2013.
9 G. Hinton et al., "Deep Neural Networks for Acoustic Modeling in Speech Recognition," The IEEE Signal Processing Magazine, Vol. 29, No. 6, pp. 82-97, Oct.2012.   DOI
10 L. Deng, G. Hinton, and B. Kingsbury, "New types of deep neural network learning for speech recognition for speech recognition and related applications: An overview," in Proceeding of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Vancouver, Canada, pp. 8599-8603, May. 2013.
11 A. Graves and N. Jaitly, "Towards end-to-end speech recognition with recurrent neural networks," in Proceeding of the 31st International Conference on Machine Learning, Beijing, China, pp. 1764-1772, 2014.
12 A. Graves, Supervised sequence labelling with recurrent neural networks, Ph.D. dissertation, Technische Universitat Munchen, Munchen, Germany, 2008.
13 A. Graves et. al, "Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks," in Proceeding of the 23rd International Conference on Machine Learning, Pittsburgh: PA, pp. 369-376, 2006.
14 S. Hochreiter and J. Schmidhuber, "Long Short-Term Memory," Neural Computation, Vol. 9, No. 8, pp. 1735-1780, Nov. 1997.   DOI
15 Y. Rao, A. Senior, and H. Sak, "Flat start training of CD-CTC-sMBR LSTM RNN acoustic models," in Proceeding of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Shanghai, China, pp. 5405-5409, 2016.
16 H. Sak, A. Senior, and F. Beaufays, "Long short-term memory based recurrent neural network architectures for large vocabulary speech recognition," arXiv:1402.1128, pp. 1-5, Feb. 2014.
17 M. Liwicki, A. Graves, H. Bunke and J. Schmidhuber, "A novel approach to on-line handwriting recognition based on bidirectional long short-term memory networks," in Proceeding of the 9th International Conference on Document Analysis and Recognition, Curitiba, Brazil, pp. 367-371, 2017.
18 Y. Miao et al., "EESEN: End-to-end speech recognition using deep RNN models and WFST-based decoding," in Proceeding of the IEEE Automatic Speech Recognition and Understanding Workshop, Scottsdale: AZ, pp. 167-174, 2015.