[KSCI] Korea Science Citation Index Service

http://dx.doi.org/10.6109/jicce.2021.19.3.148

Speech Emotion Recognition Using 2D-CNN with Mel-Frequency Cepstrum Coefficients

Eom, Youngsik (Department of Electronic and Electrical Engineering, Sungkyunkwan University)
Bang, Junseong (Public Safety Intelligence Research Section, Electronics and Telecommunications Research Institute (ETRI))

Publication Information

Journal of information and communication convergence engineering / v.19, no.3, 2021 , pp. 148-154 More about this Journal

Abstract

With the advent of context-aware computing, many attempts were made to understand emotions. Among these various attempts, Speech Emotion Recognition (SER) is a method of recognizing the speaker's emotions through speech information. The SER is successful in selecting distinctive 'features' and 'classifying' them in an appropriate way. In this paper, the performances of SER using neural network models (e.g., fully connected network (FCN), convolutional neural network (CNN)) with Mel-Frequency Cepstral Coefficients (MFCC) are examined in terms of the accuracy and distribution of emotion recognition. For Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) dataset, by tuning model parameters, a two-dimensional Convolutional Neural Network (2D-CNN) model with MFCC showed the best performance with an average accuracy of 88.54% for 5 emotions, anger, happiness, calm, fear, and sadness, of men and women. In addition, by examining the distribution of emotion recognition accuracies for neural network models, the 2D-CNN with MFCC can expect an overall accuracy of 75% or more.

Keywords

Convolutional neural network; Deep learning; Mel-frequency cepstrum coefficients; Speech emotion recognition;

Citations & Related Records

Reference

1	S. Byun and S. Lee, "Emotion recognition using tone and tempo based on voice for IoT," Trans. of the Korean Institute of Electrical Engineers, vol. 65, no. 1, pp. 116-121, 2016. DOI: 10.5370/kiee.2016.65.1.116. DOI
2	I. Hong, Y. Ko, Y. Kim, and H. Shin, "A study on the emotional feature composed of the mel-frequency cepstral coefficient and the speech speed," Journal of Computing Science and Engineering, vol. 13, no. 4, pp. 131-140, 2019. DOI: 10.5626/JCSE.2019.13.4.131 DOI
3	M. S. Likitha, S. R. R. Gupta, K. Hasitha, and A. U. Raju, "Speech based human emotion recognition using MFCC," in 2017 Int. Conf. on Wireless Communications, Signal Processing and Networking (WiSPNET), pp. 2257-2260, Mar. 2017. DOI: 10.1109/WiSPNET.2017.8300161. DOI
4	librosa [Internet]. Available: https://librosa.org/doc/latest/index.html.
5	J. Lee, H. Ryu, D. Chang, and M. Koo, "End-to-end Korean speech emotion recognition using deep neural networks," in Korea Computer Congress, pp. 1000-1002, Jun. 2018.
6	K. Simonyan and A. Zisserman, "Very deep convolutional networks for large-scale image recognition," arXiv: 1409.1556, 2014.
7	S. Park, D. Kim, S. Kwon, and N. Park, "Speech emotion recognition based on CNN using spectrogram," in Information and Control Symposium, pp. 240-241, Oct. 2018.
8	G. Tangriberganov, T. A. Adesuyi, and B. Kim, "A hybrid approach for speech emotion recognition using 1D-CNN LSTM," in Korea Computer Congress, pp. 833-835, July. 2020.
9	G. Trigeorgis, F. Ringeval, R. Brueckner, E. Marchi, M. A. Nicolaou, B. Schuller, and S. Zafeiriou, "Adieu features? end-to-end speech emotion recognition using a deep convolutional recurrent network," in 2016 IEEE Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP), pp. 5200-5204, Mar. 2016. DOI: 10.1109/ICASSP.2016.7472669. DOI
10	P. Mishra and R. Sharma, "Gender differentiated convolutional neural networks for speech emotion recognition," in 12th Int. Congress on Ultra Modern Telecommunications and Control Systems and Workshops (ICUMT), pp. 142-148, Oct. 2020. DOI: 10.1109/ICUMT51630.2020.9222412. DOI
11	W. Tang, G. Long, L. Liu, T. Zhou, J. Jiang, and M. Blumenstein, "Rethinking 1D-CNN for time series classification: a stronger baseline," arXiv: 2002.10061, 2020.
12	J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei, "Image-Net: a large-scale hierarchical image database," in 2009 IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pp. 248-255, Jun. 2009. DOI: 10.1109/CVPR.2009.5206848. DOI
13	J. Lee, U. Yoon, and G. Jo, "CNN-based speech emotion recognition model applying transfer learning and attention mechanism," Journal of KIISE, vol. 47, no. 7, pp. 665-673, 2020. DOI: 10.5626/JOK.2020.47.7.665 DOI
14	S. R. Livingstone and F. A. Russo, "The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): a dynamic, multimodal set of facial and vocal expressions in North American English," PLoS ONE, vol. 13, no. 5, pp. e0196391, May. 2018. DOI: 10.1371/journal.pone.0196391. DOI
15	L. Huang, J. Dong, D. Zhou, and Q. Zhang, "Speech emotion recognition based on three-channel feature fusion of CNN and BiLSTM," in 2020 the 4th International Conference on Innovation in Artificial Intelligence (ICIAI), pp. 52-58, May. 2020. DOI: 10.1145/3390557.3394317 DOI