Fig. 1. Phoneme Classifier Overview
Fig. 2. VGG Networks based DeepCNN, DeepCNN+BiLSTM, DeepCNN+BiGRU Models
Fig. 3. DeepCNN+BiGRU(1hop~5hop), DeepCNN+BiLSTM(1hop~5hop) Models with Improved DeepCNN Models
Fig. 4. Temporal Labels for Phoneme Recognition Models. One Phoneme Label per 30ms is Displayed in Each Cell
Table 1. Results for Phoneme Recognition
Table 2. Results for Phoneme Recognition
References
- Gales, Mark JF. "Maximum likelihood linear transformations for HMM-based speech recognition," Computer Speech & Language, Vol.12, No.2, pp.75-98, 1998. https://doi.org/10.1006/csla.1998.0043
- Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, "Gradientbased learning applied to document recognition," Proceedings of the IEEE, Vol.86, No.11, pp.2278-2324, 1998. https://doi.org/10.1109/5.726791
- Glass, James R. "A probabilistic framework for segmentbased speech recognition," Computer Speech & Language , Vol.17, No.2-3, pp.137-152, 2003. https://doi.org/10.1016/S0885-2308(03)00006-8
- Schwarz, Petr, Pavel Matejka, and Jan Cernocky. "Towards lower error rates in phoneme recognition," International Conference on Text, Speech and Dialogue. Springer, Berlin, Heidelberg, 2004.
- Waibel, Alexander, et al., "Phoneme recognition using timedelay neural networks," Readings in Speech Recognition, 1990. 393-404.
- Bengio, Yoshua. "A connectionist approach to speech recognition," Advances in Pattern Recognition Systems Using Neural Network Technologies, pp.3-23. 1993.
- Mohamed, Abdel-rahman, George E. Dahl, and Geoffrey Hinton. "Acoustic modeling using deep belief networks," IEEE Transactions on Audio, Speech, and Language Processing, Vol.20, No.1, pp.14-22, 2012. https://doi.org/10.1109/TASL.2011.2109382
- Ardussi Mines, M., Hanson, B. F., & Shoup, J. E. "Frequency of Occurrence of Phonemes in Conversational English," Language and Speech, Vol.21, No.3, pp.221-241, 1978. https://doi.org/10.1177/002383097802100302
- Ji-Young Shin. "Phoneme and Syllable Frequencies of Korean Based on the Analysis of Spontaneous Speech Data," Communication Sciences and Disorders, Vol.13, No.2, pp.193-215, 2008.
- Seltzer, Michael L., and Jasha Droppo. "Multi-task learning in deep neural networks for improved phoneme recognition," Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on. IEEE, 2013.
- Graves, Alex, Navdeep Jaitly, and Abdel-rahman Mohamed. "Hybrid speech recognition with deep bidirectional LSTM," Automatic Speech Recognition and Understanding (ASRU), 2013 IEEE Workshop on. IEEE, 2013.
- Graves, Alex, Abdel-rahman Mohamed, and Geoffrey Hinton. "Speech recognition with deep recurrent neural networks," Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on. IEEE, 2013.
- Minsoo Na and Minhwa Chung, "Assistive Program for Automatic Speech Transcription based on G2P Conversion and Speech Recognition," Proc. Conference on Korean Society of Speech Sciences, pp.131-132, 2016.
- Palaz, Dimitri, Ronan Collobert, and Mathew Magimai Doss. "End-to-end phoneme sequence recognition using convolutional neural networks," arXiv preprint arXiv: 1312.2137 (2013).
- Palaz, Dimitri, Mathew Magimai Doss, and Ronan Collobert. "Convolutional neural networks-based continuous speech recognition using raw speech signal," Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on. IEEE, 2015.
- Heck, Michael, et al., "Ensembles of Multi-scale VGG Acoustic Models," Proc. Interspeech 2017 (2017): 1616-1620.
- Zhang, Ying, et al., "Towards end-to-end speech recognition with deep convolutional neural networks," arXiv preprint arXiv:1701.02720 (2017).
- Graves, Alex, et al., "Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks," Proceedings of the 23rd international conference on Machine learning. ACM, 2006.
- Hori, Takaaki, et al., "Advances in joint CTC-attention based end-to-end speech recognition with a deep CNN encoder and RNN-LM," arXiv preprint arXiv:1706.02737 (2017).
- National Institute of the Korean Language (NIKL), Seoul Reading Speech Corpus("서울말 낭독체 발화 말뭉치"), 2003. URL: https://ithub.korean.go.kr
- Yejin Cho, Korean Grapheme-to-Phoneme Analyzer (KoG2P), 2017. GitHub repository : https://github.com/scarletcho/KoG2P
- Simonyan, Karen, and Andrew Zisserman. "Very deep convolutional networks for large-scale image recognition," arXiv preprint arXiv:1409.1556 (2014).
- Amodei, Dario, et al., "Deep speech 2: End-to-end speech recognition in english and mandarin," International Conference on Machine Learning. 2016.
- Xiong, Wayne, et al., "The Microsoft 2016 conversational speech recognition system," Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on. IEEE, 2017.
- Sainath, Tara N., et al., "Convolutional, long short-term memory, fully connected deep neural networks," Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on. IEEE, 2015.
- Chung, Junyoung, et al., "Empirical evaluation of gated recurrent neural networks on sequence modeling," arXiv preprint arXiv:1412.3555 (2014).
- Xiong, Wayne, et al., "The Microsoft 2016 conversational speech recognition system," Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on. IEEE, 2017.