[KSCI] Korea Science Citation Index Service

http://dx.doi.org/10.3745/KTSDE.2019.8.3.115

CRNN-Based Korean Phoneme Recognition Model with CTC Algorithm

Hong, Yoonseok (서울대학교 융합과학부 인지컴퓨팅연구실)
Ki, Kyungseo (서울대학교 융합과학부 인지컴퓨팅연구실)
Gweon, Gahgene (서울대학교 융합과학부)

Publication Information

KIPS Transactions on Software and Data Engineering / v.8, no.3, 2019 , pp. 115-122 More about this Journal

Abstract

For Korean phoneme recognition, Hidden Markov-Gaussian Mixture model(HMM-GMM) or hybrid models which combine artificial neural network with HMM have been mainly used. However, current approach has limitations in that such models require force-aligned corpus training data that is manually annotated by experts. Recently, researchers used neural network based phoneme recognition model which combines recurrent neural network(RNN)-based structure with connectionist temporal classification(CTC) algorithm to overcome the problem of obtaining manually annotated training data. Yet, in terms of implementation, these RNN-based models have another difficulty in that the amount of data gets larger as the structure gets more sophisticated. This problem of large data size is particularly problematic in the Korean language, which lacks refined corpora. In this study, we introduce CTC algorithm that does not require force-alignment to create a Korean phoneme recognition model. Specifically, the phoneme recognition model is based on convolutional neural network(CNN) which requires relatively small amount of data and can be trained faster when compared to RNN based models. We present the results from two different experiments and a resulting best performing phoneme recognition model which distinguishes 49 Korean phonemes. The best performing phoneme recognition model combines CNN with 3hop Bidirectional LSTM with the final Phoneme Error Rate(PER) at 3.26. The PER is a considerable improvement compared to existing Korean phoneme recognition models that report PER ranging from 10 to 12.

Keywords

Phoneme Recognition; CTC Algorithm; Convolutional Neural Network; Recurrent Neural Network;

Citations & Related Records

Reference

1	Zhang, Ying, et al., "Towards end-to-end speech recognition with deep convolutional neural networks," arXiv preprint arXiv:1701.02720 (2017).
2	Graves, Alex, et al., "Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks," Proceedings of the 23rd international conference on Machine learning. ACM, 2006.
3	Hori, Takaaki, et al., "Advances in joint CTC-attention based end-to-end speech recognition with a deep CNN encoder and RNN-LM," arXiv preprint arXiv:1706.02737 (2017).
4	National Institute of the Korean Language (NIKL), Seoul Reading Speech Corpus("서울말 낭독체 발화 말뭉치"), 2003. URL: https://ithub.korean.go.kr
5	Yejin Cho, Korean Grapheme-to-Phoneme Analyzer (KoG2P), 2017. GitHub repository : https://github.com/scarletcho/KoG2P
6	Simonyan, Karen, and Andrew Zisserman. "Very deep convolutional networks for large-scale image recognition," arXiv preprint arXiv:1409.1556 (2014).
7	Amodei, Dario, et al., "Deep speech 2: End-to-end speech recognition in english and mandarin," International Conference on Machine Learning. 2016.
8	Xiong, Wayne, et al., "The Microsoft 2016 conversational speech recognition system," Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on. IEEE, 2017.
9	Sainath, Tara N., et al., "Convolutional, long short-term memory, fully connected deep neural networks," Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on. IEEE, 2015.
10	Chung, Junyoung, et al., "Empirical evaluation of gated recurrent neural networks on sequence modeling," arXiv preprint arXiv:1412.3555 (2014).
11	Xiong, Wayne, et al., "The Microsoft 2016 conversational speech recognition system," Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on. IEEE, 2017.
12	Schwarz, Petr, Pavel Matejka, and Jan Cernocky. "Towards lower error rates in phoneme recognition," International Conference on Text, Speech and Dialogue. Springer, Berlin, Heidelberg, 2004.
13	Gales, Mark JF. "Maximum likelihood linear transformations for HMM-based speech recognition," Computer Speech & Language, Vol.12, No.2, pp.75-98, 1998. DOI
14	Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, "Gradientbased learning applied to document recognition," Proceedings of the IEEE, Vol.86, No.11, pp.2278-2324, 1998. DOI
15	Glass, James R. "A probabilistic framework for segmentbased speech recognition," Computer Speech & Language , Vol.17, No.2-3, pp.137-152, 2003. DOI
16	Waibel, Alexander, et al., "Phoneme recognition using timedelay neural networks," Readings in Speech Recognition, 1990. 393-404.
17	Ji-Young Shin. "Phoneme and Syllable Frequencies of Korean Based on the Analysis of Spontaneous Speech Data," Communication Sciences and Disorders, Vol.13, No.2, pp.193-215, 2008.
18	Bengio, Yoshua. "A connectionist approach to speech recognition," Advances in Pattern Recognition Systems Using Neural Network Technologies, pp.3-23. 1993.
19	Mohamed, Abdel-rahman, George E. Dahl, and Geoffrey Hinton. "Acoustic modeling using deep belief networks," IEEE Transactions on Audio, Speech, and Language Processing, Vol.20, No.1, pp.14-22, 2012. DOI
20	Ardussi Mines, M., Hanson, B. F., & Shoup, J. E. "Frequency of Occurrence of Phonemes in Conversational English," Language and Speech, Vol.21, No.3, pp.221-241, 1978. DOI
21	Seltzer, Michael L., and Jasha Droppo. "Multi-task learning in deep neural networks for improved phoneme recognition," Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on. IEEE, 2013.
22	Graves, Alex, Navdeep Jaitly, and Abdel-rahman Mohamed. "Hybrid speech recognition with deep bidirectional LSTM," Automatic Speech Recognition and Understanding (ASRU), 2013 IEEE Workshop on. IEEE, 2013.
23	Graves, Alex, Abdel-rahman Mohamed, and Geoffrey Hinton. "Speech recognition with deep recurrent neural networks," Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on. IEEE, 2013.
24	Minsoo Na and Minhwa Chung, "Assistive Program for Automatic Speech Transcription based on G2P Conversion and Speech Recognition," Proc. Conference on Korean Society of Speech Sciences, pp.131-132, 2016.
25	Palaz, Dimitri, Ronan Collobert, and Mathew Magimai Doss. "End-to-end phoneme sequence recognition using convolutional neural networks," arXiv preprint arXiv: 1312.2137 (2013).
26	Heck, Michael, et al., "Ensembles of Multi-scale VGG Acoustic Models," Proc. Interspeech 2017 (2017): 1616-1620.
27	Palaz, Dimitri, Mathew Magimai Doss, and Ronan Collobert. "Convolutional neural networks-based continuous speech recognition using raw speech signal," Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on. IEEE, 2015.

KSCI

CRNN-Based Korean Phoneme Recognition Model with CTC Algorithm CTC를 적용한 CRNN 기반 한국어 음소인식 모델 연구

CRNN-Based Korean Phoneme Recognition Model with CTC Algorithm