[KSCI] Korea Science Citation Index Service

http://dx.doi.org/10.5909/JBE.2020.25.5.742

Performance Enhancement of Phoneme and Emotion Recognition by Multi-task Training of Common Neural Network

Kim, Jaewon (Dept. of Electronics Engineering, Kwangwoon University)
Park, Hochong (Dept. of Electronics Engineering, Kwangwoon University)

Publication Information

Journal of Broadcast Engineering / v.25, no.5, 2020 , pp. 742-749 More about this Journal

Abstract

This paper proposes a method for recognizing both phoneme and emotion using a common neural network and a multi-task training method for the common neural network. The common neural network performs the same function for both recognition tasks, which corresponds to the structure of multi-information recognition of human using a single auditory system. The multi-task training conducts a feature modeling that is commonly applicable to multiple information and provides generalized training, which enables to improve the performance by reducing an overfitting occurred in the conventional individual training for each information. A method for increasing phoneme recognition performance is also proposed that applies weight to the phoneme in the multi-task training. When using the same feature vector and neural network, it is confirmed that the proposed common neural network with multi-task training provides higher performance than the individual one trained for each task.

Keywords

deep neural network; common recognition; multi-task training; emotion recognition; phoneme recognition;

Citations & Related Records

Times Cited By KSCI : 4 (Citation Analysis)

Reference
Cited By KSCI

1	V. Zue, S. Seneff, and J. Glass, "Speech database development at MIT: TIMIT and beyond," Speech Communication, vol. 9, no. 4, pp. 351-356, Aug. 1990, doi:10.1016/0167-6393(90)90010-7. DOI
2	K. He, X. Zhang, S. Ren, and J. Sun, "Delving deep into rectifiers: Surpassing human-level performance on imagenet classification," Proc. on IEEE Int. Conf. on Computer Vision, pp. 1026-1034, 2015, doi:10.1109/iccv.2015.123.
3	D. P. Kingma, and J. Ba, "Adam: A method for stochastic optimization," arXiv:1412.6980, Dec. 2014.
4	J. P. Campbell, "Speaker recognition: A tutorial," Proceedings of the IEEE, vol. 85, no. 9, pp. 1437-1462, Sept. 1997, doi:10.1109/5.628714. DOI
5	W. J. Jang, H. W. Yun, S. H. Shin, H. J. Cho, W. Jang, and H. Park, "Music genre classification using spikegram and deep neural network," J. of Broadcast Engineering, vol. 22, no. 6, pp. 693-701, Nov. 2017, doi:10.5909/JBE.2017.22.6.693. DOI
6	S. H. Shin, H. W. Yun, W. J. Jang, and H. Park, "Extraction of acoustic features based on auditory spike code and its application to music genre classification," IET Signal Processing, vol. 13, no. 2, pp. 230-234, Apr. 2019, doi:10.1049/iet-spr.2018.5158. DOI
7	A. Graves, A. R. Mohamed, and G. Hinton, "Speech recognition with deep recurrent neural networks," Proc. on IEEE Int. Conf. on Acoustics, Speech and Signal Processing, pp. 6645-6649, May 2013, doi:10.1109/ICASSP.2013.6638947.
8	T. L. Nwe, S. W. Foo, and L. C. De Silva, "Speech emotion recognition using hidden Markov models," Speech communication, vol. 41, no. 4, pp. 603-623, Nov. 2003, doi:10.1016/S0167-6393(03)00099-2. DOI
9	S. Han, J. Kim, S. An, S. Shin, and H. Park, "Speech feature extraction based on spikegram for phoneme recognition," J. of Broadcast Engineering, vol. 24, no. 5, pp. 735-742, Sept. 2019, doi:10.5909/JBE.2019.24.5.735. DOI
10	I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning, MIT Press, Cambridge and London, 2016.
11	N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, "Dropout: a simple way to prevent neural networks from overfitting," The Journal of Machine Learning Research, vol. 15, no. 1, pp. 1929-1958, Jan 2014, doi:10.5555/2627435.2670313.
12	R. Caruana, "Multitask learning," Machine Learning, vol. 28, no. 1, pp.41-75, 1997, doi:10.1023/A:1007379606734. DOI
13	ETSI, Speech processing, transmission and quality aspects (STQ); Distributed speech recognition; Extended front-end feature extraction algorithm; Compression algorithm; Back-end speech reconstruction algorithm, ETSI ES 202 211, v1.1.1, Nov. 2003.
14	E. Cakir, T. Heittola, H. Huttunen, and T. Virtanen, "Polyphonic sound event detection using multi label deep neural networks," Proc. on Int. Joint Conf. on Neural Networks, pp. 1-7, July 2015, doi:10.1109/IJCNN.2015.7280624.
15	S. J. Pan, and Q. Yang, "A survey on transfer learning," IEEE Trans. on Knowledge and Data Engineering, vol. 22, no. 10, pp. 1345-1359, Oct. 2009, doi:10.1109/TKDE.2009.191. DOI
16	B. Logan, "Mel frequency cepstral coefficients for music modeling," ISMIR, vol. 270, pp. 1-11, Oct. 2000.
17	X. Huang, A. Acero, and H. Hon, Spoken Language Processing: A Guide to Theory, Algorithm, and System Development, Prentice Hall, pp. 423-424, 2001.
18	C. Busso, M. Bulut, C. C. Lee, A. Kazemzadeh, E. Mower, S. Kim, J, N. Chang, S. Lee, and S. S. Narayanan, "IEMOCAP: interactive emotional dyadic motion capture database," Language Resources and Evaluation, vol. 42, no. 4, pp. 335, Dec. 2008, doi:10.1007/s10579-008-9076-6. DOI

KSCI

Performance Enhancement of Phoneme and Emotion Recognition by Multi-task Training of Common Neural Network 공용 신경망의 다중 학습을 통한 음소와 감정 인식의 성능 향상

Performance Enhancement of Phoneme and Emotion Recognition by Multi-task Training of Common Neural Network