Browse > Article
http://dx.doi.org/10.3745/KTSDE.2019.8.3.115

CRNN-Based Korean Phoneme Recognition Model with CTC Algorithm  

Hong, Yoonseok (서울대학교 융합과학부 인지컴퓨팅연구실)
Ki, Kyungseo (서울대학교 융합과학부 인지컴퓨팅연구실)
Gweon, Gahgene (서울대학교 융합과학부)
Publication Information
KIPS Transactions on Software and Data Engineering / v.8, no.3, 2019 , pp. 115-122 More about this Journal
Abstract
For Korean phoneme recognition, Hidden Markov-Gaussian Mixture model(HMM-GMM) or hybrid models which combine artificial neural network with HMM have been mainly used. However, current approach has limitations in that such models require force-aligned corpus training data that is manually annotated by experts. Recently, researchers used neural network based phoneme recognition model which combines recurrent neural network(RNN)-based structure with connectionist temporal classification(CTC) algorithm to overcome the problem of obtaining manually annotated training data. Yet, in terms of implementation, these RNN-based models have another difficulty in that the amount of data gets larger as the structure gets more sophisticated. This problem of large data size is particularly problematic in the Korean language, which lacks refined corpora. In this study, we introduce CTC algorithm that does not require force-alignment to create a Korean phoneme recognition model. Specifically, the phoneme recognition model is based on convolutional neural network(CNN) which requires relatively small amount of data and can be trained faster when compared to RNN based models. We present the results from two different experiments and a resulting best performing phoneme recognition model which distinguishes 49 Korean phonemes. The best performing phoneme recognition model combines CNN with 3hop Bidirectional LSTM with the final Phoneme Error Rate(PER) at 3.26. The PER is a considerable improvement compared to existing Korean phoneme recognition models that report PER ranging from 10 to 12.
Keywords
Phoneme Recognition; CTC Algorithm; Convolutional Neural Network; Recurrent Neural Network;
Citations & Related Records
연도 인용수 순위
  • Reference
1 Zhang, Ying, et al., "Towards end-to-end speech recognition with deep convolutional neural networks," arXiv preprint arXiv:1701.02720 (2017).
2 Graves, Alex, et al., "Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks," Proceedings of the 23rd international conference on Machine learning. ACM, 2006.
3 Hori, Takaaki, et al., "Advances in joint CTC-attention based end-to-end speech recognition with a deep CNN encoder and RNN-LM," arXiv preprint arXiv:1706.02737 (2017).
4 National Institute of the Korean Language (NIKL), Seoul Reading Speech Corpus("서울말 낭독체 발화 말뭉치"), 2003. URL: https://ithub.korean.go.kr
5 Yejin Cho, Korean Grapheme-to-Phoneme Analyzer (KoG2P), 2017. GitHub repository : https://github.com/scarletcho/KoG2P
6 Simonyan, Karen, and Andrew Zisserman. "Very deep convolutional networks for large-scale image recognition," arXiv preprint arXiv:1409.1556 (2014).
7 Amodei, Dario, et al., "Deep speech 2: End-to-end speech recognition in english and mandarin," International Conference on Machine Learning. 2016.
8 Xiong, Wayne, et al., "The Microsoft 2016 conversational speech recognition system," Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on. IEEE, 2017.
9 Sainath, Tara N., et al., "Convolutional, long short-term memory, fully connected deep neural networks," Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on. IEEE, 2015.
10 Chung, Junyoung, et al., "Empirical evaluation of gated recurrent neural networks on sequence modeling," arXiv preprint arXiv:1412.3555 (2014).
11 Xiong, Wayne, et al., "The Microsoft 2016 conversational speech recognition system," Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on. IEEE, 2017.
12 Schwarz, Petr, Pavel Matejka, and Jan Cernocky. "Towards lower error rates in phoneme recognition," International Conference on Text, Speech and Dialogue. Springer, Berlin, Heidelberg, 2004.
13 Gales, Mark JF. "Maximum likelihood linear transformations for HMM-based speech recognition," Computer Speech & Language, Vol.12, No.2, pp.75-98, 1998.   DOI
14 Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, "Gradientbased learning applied to document recognition," Proceedings of the IEEE, Vol.86, No.11, pp.2278-2324, 1998.   DOI
15 Glass, James R. "A probabilistic framework for segmentbased speech recognition," Computer Speech & Language , Vol.17, No.2-3, pp.137-152, 2003.   DOI
16 Waibel, Alexander, et al., "Phoneme recognition using timedelay neural networks," Readings in Speech Recognition, 1990. 393-404.
17 Ji-Young Shin. "Phoneme and Syllable Frequencies of Korean Based on the Analysis of Spontaneous Speech Data," Communication Sciences and Disorders, Vol.13, No.2, pp.193-215, 2008.
18 Bengio, Yoshua. "A connectionist approach to speech recognition," Advances in Pattern Recognition Systems Using Neural Network Technologies, pp.3-23. 1993.
19 Mohamed, Abdel-rahman, George E. Dahl, and Geoffrey Hinton. "Acoustic modeling using deep belief networks," IEEE Transactions on Audio, Speech, and Language Processing, Vol.20, No.1, pp.14-22, 2012.   DOI
20 Ardussi Mines, M., Hanson, B. F., & Shoup, J. E. "Frequency of Occurrence of Phonemes in Conversational English," Language and Speech, Vol.21, No.3, pp.221-241, 1978.   DOI
21 Seltzer, Michael L., and Jasha Droppo. "Multi-task learning in deep neural networks for improved phoneme recognition," Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on. IEEE, 2013.
22 Graves, Alex, Navdeep Jaitly, and Abdel-rahman Mohamed. "Hybrid speech recognition with deep bidirectional LSTM," Automatic Speech Recognition and Understanding (ASRU), 2013 IEEE Workshop on. IEEE, 2013.
23 Graves, Alex, Abdel-rahman Mohamed, and Geoffrey Hinton. "Speech recognition with deep recurrent neural networks," Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on. IEEE, 2013.
24 Minsoo Na and Minhwa Chung, "Assistive Program for Automatic Speech Transcription based on G2P Conversion and Speech Recognition," Proc. Conference on Korean Society of Speech Sciences, pp.131-132, 2016.
25 Palaz, Dimitri, Ronan Collobert, and Mathew Magimai Doss. "End-to-end phoneme sequence recognition using convolutional neural networks," arXiv preprint arXiv: 1312.2137 (2013).
26 Heck, Michael, et al., "Ensembles of Multi-scale VGG Acoustic Models," Proc. Interspeech 2017 (2017): 1616-1620.
27 Palaz, Dimitri, Mathew Magimai Doss, and Ronan Collobert. "Convolutional neural networks-based continuous speech recognition using raw speech signal," Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on. IEEE, 2015.