Browse > Article
http://dx.doi.org/10.5909/JBE.2020.25.5.742

Performance Enhancement of Phoneme and Emotion Recognition by Multi-task Training of Common Neural Network  

Kim, Jaewon (Dept. of Electronics Engineering, Kwangwoon University)
Park, Hochong (Dept. of Electronics Engineering, Kwangwoon University)
Publication Information
Journal of Broadcast Engineering / v.25, no.5, 2020 , pp. 742-749 More about this Journal
Abstract
This paper proposes a method for recognizing both phoneme and emotion using a common neural network and a multi-task training method for the common neural network. The common neural network performs the same function for both recognition tasks, which corresponds to the structure of multi-information recognition of human using a single auditory system. The multi-task training conducts a feature modeling that is commonly applicable to multiple information and provides generalized training, which enables to improve the performance by reducing an overfitting occurred in the conventional individual training for each information. A method for increasing phoneme recognition performance is also proposed that applies weight to the phoneme in the multi-task training. When using the same feature vector and neural network, it is confirmed that the proposed common neural network with multi-task training provides higher performance than the individual one trained for each task.
Keywords
deep neural network; common recognition; multi-task training; emotion recognition; phoneme recognition;
Citations & Related Records
Times Cited By KSCI : 4  (Citation Analysis)
연도 인용수 순위
1 V. Zue, S. Seneff, and J. Glass, "Speech database development at MIT: TIMIT and beyond," Speech Communication, vol. 9, no. 4, pp. 351-356, Aug. 1990, doi:10.1016/0167-6393(90)90010-7.   DOI
2 K. He, X. Zhang, S. Ren, and J. Sun, "Delving deep into rectifiers: Surpassing human-level performance on imagenet classification," Proc. on IEEE Int. Conf. on Computer Vision, pp. 1026-1034, 2015, doi:10.1109/iccv.2015.123.
3 D. P. Kingma, and J. Ba, "Adam: A method for stochastic optimization," arXiv:1412.6980, Dec. 2014.
4 J. P. Campbell, "Speaker recognition: A tutorial," Proceedings of the IEEE, vol. 85, no. 9, pp. 1437-1462, Sept. 1997, doi:10.1109/5.628714.   DOI
5 W. J. Jang, H. W. Yun, S. H. Shin, H. J. Cho, W. Jang, and H. Park, "Music genre classification using spikegram and deep neural network," J. of Broadcast Engineering, vol. 22, no. 6, pp. 693-701, Nov. 2017, doi:10.5909/JBE.2017.22.6.693.   DOI
6 S. H. Shin, H. W. Yun, W. J. Jang, and H. Park, "Extraction of acoustic features based on auditory spike code and its application to music genre classification," IET Signal Processing, vol. 13, no. 2, pp. 230-234, Apr. 2019, doi:10.1049/iet-spr.2018.5158.   DOI
7 A. Graves, A. R. Mohamed, and G. Hinton, "Speech recognition with deep recurrent neural networks," Proc. on IEEE Int. Conf. on Acoustics, Speech and Signal Processing, pp. 6645-6649, May 2013, doi:10.1109/ICASSP.2013.6638947.
8 T. L. Nwe, S. W. Foo, and L. C. De Silva, "Speech emotion recognition using hidden Markov models," Speech communication, vol. 41, no. 4, pp. 603-623, Nov. 2003, doi:10.1016/S0167-6393(03)00099-2.   DOI
9 S. Han, J. Kim, S. An, S. Shin, and H. Park, "Speech feature extraction based on spikegram for phoneme recognition," J. of Broadcast Engineering, vol. 24, no. 5, pp. 735-742, Sept. 2019, doi:10.5909/JBE.2019.24.5.735.   DOI
10 I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning, MIT Press, Cambridge and London, 2016.
11 N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, "Dropout: a simple way to prevent neural networks from overfitting," The Journal of Machine Learning Research, vol. 15, no. 1, pp. 1929-1958, Jan 2014, doi:10.5555/2627435.2670313.
12 R. Caruana, "Multitask learning," Machine Learning, vol. 28, no. 1, pp.41-75, 1997, doi:10.1023/A:1007379606734.   DOI
13 ETSI, Speech processing, transmission and quality aspects (STQ); Distributed speech recognition; Extended front-end feature extraction algorithm; Compression algorithm; Back-end speech reconstruction algorithm, ETSI ES 202 211, v1.1.1, Nov. 2003.
14 E. Cakir, T. Heittola, H. Huttunen, and T. Virtanen, "Polyphonic sound event detection using multi label deep neural networks," Proc. on Int. Joint Conf. on Neural Networks, pp. 1-7, July 2015, doi:10.1109/IJCNN.2015.7280624.
15 S. J. Pan, and Q. Yang, "A survey on transfer learning," IEEE Trans. on Knowledge and Data Engineering, vol. 22, no. 10, pp. 1345-1359, Oct. 2009, doi:10.1109/TKDE.2009.191.   DOI
16 B. Logan, "Mel frequency cepstral coefficients for music modeling," ISMIR, vol. 270, pp. 1-11, Oct. 2000.
17 X. Huang, A. Acero, and H. Hon, Spoken Language Processing: A Guide to Theory, Algorithm, and System Development, Prentice Hall, pp. 423-424, 2001.
18 C. Busso, M. Bulut, C. C. Lee, A. Kazemzadeh, E. Mower, S. Kim, J, N. Chang, S. Lee, and S. S. Narayanan, "IEMOCAP: interactive emotional dyadic motion capture database," Language Resources and Evaluation, vol. 42, no. 4, pp. 335, Dec. 2008, doi:10.1007/s10579-008-9076-6.   DOI