Speech emotion recognition based on genetic algorithm-decision tree fusion of deep and acoustic features

Sun, Linhui;Li, Qiu;Fu, Sheng;Li, Pingan;

doi:10.4218/etrij.2020-0458

ETRI Journal

Volume 44 Issue 3
/
Pages.462-475
/
2022
/
1225-6463(pISSN)
/
2233-7326(eISSN)

Electronics and Telecommunications Research Institute (한국전자통신연구원)

DOI QR Code

Speech emotion recognition based on genetic algorithm-decision tree fusion of deep and acoustic features

Sun, Linhui (College of Telecommunications and Information Engineering, Nanjing University of Posts and Telecommunications) ;
Li, Qiu (College of Telecommunications and Information Engineering, Nanjing University of Posts and Telecommunications) ;
Fu, Sheng (College of Telecommunications and Information Engineering, Nanjing University of Posts and Telecommunications) ;
Li, Pingan (College of Telecommunications and Information Engineering, Nanjing University of Posts and Telecommunications)

Received : 2020.12.10
Accepted : 2021.07.07
Published : 2022.06.10

https://doi.org/10.4218/etrij.2020-0458 Citation PDF KSCI

Download PDF

⟨ Previous Next ⟩

Abstract

Although researchers have proposed numerous techniques for speech emotion recognition, its performance remains unsatisfactory in many application scenarios. In this study, we propose a speech emotion recognition model based on a genetic algorithm (GA)-decision tree (DT) fusion of deep and acoustic features. To more comprehensively express speech emotional information, first, frame-level deep and acoustic features are extracted from a speech signal. Next, five kinds of statistic variables of these features are calculated to obtain utterance-level features. The Fisher feature selection criterion is employed to select high-performance features, removing redundant information. In the feature fusion stage, the GA is is used to adaptively search for the best feature fusion weight. Finally, using the fused feature, the proposed speech emotion recognition model based on a DT support vector machine model is realized. Experimental results on the Berlin speech emotion database and the Chinese emotion speech database indicate that the proposed model outperforms an average weight fusion method.

Keywords

Acknowledgement

This work was supported by the National Natural Science Foundation of China (No.61901227) and the Natural Science Foundation for Colleges and Universities in Jiangsu Province, China (No.19KJB510049).

References

F. Eyben et al., The Geneva minimalistic acoustic parameter set (GeMAPS) for voice research and affective computing, IEEE Trans. Affect. Comput. 7 (2015), 190-202. https://doi.org/10.1109/TAFFC.2015.2457417
A. Origlia, V. Galata, and B. Ludusan, Automatic classification of emotions via global and local prosodic features on a multilingual emotional database, in Proc. Int. Conf. Speech Prosody (Chicago, IL, USA), May 2010, pp. 1-4.
W. H. Li and L. Jiang, Analysis of common feature recognition performance of Chinese speech emotion, Intell. Comput. Appl. 7 (2017), 56-58.
W. H. Cao, J. P. Xu, and Z. T. Liu, Speaker-independent speech emotion recognition based on random forest feature selection algorithm, in Proc. Chin. Control. Conf. (CCC), (Dalian, China), July 2017, pp. 10995-10998.
M. Lugger and B. Yang, The relevance of voice quality features in speaker independent emotion recognition, in Proc. IEEE Int. Conf. Acoust., Speech Signal Process. (Honolulu, HI, USA), Apr. 2007, doi: 10.1109/ICASSP.2007.367152
P. Shi, Speech emotion recognition based on deep belief network, in Proc. Int. Conf. Netw., Sens. Control (Zhuhai, China), Mar. 2018, pp. 1-5.
K. H. Lee, H. K. Choi, and B. T. Jang, A Study on speech emotion recognition using a deep neural network (in Proc. Int. Conf. Inf. Commun. Technol. Converg. (Jeju, Rep. of Korea), Oct. 2019, pp. 1162-1165.
Z. Yao et al., Speech emotion recognition using fusion of three multi-task learning-based classifiers: HSF-DNN, MS-CNN and LLD-RNN, Speech Commun. 120 (2020), 11-19. https://doi.org/10.1016/j.specom.2020.03.005
G. Liu, W. He, and B. Jin, Feature fusion of speech emotion recognition based on deep learning, in Proc. Int. Conf. Netw. Infrastruct. Digit. Content Guiyang China, Aug. 2018, pp. 193-197.
L. Chao et al., Improving generation performance of speech emotion recognition by denoising autoencoders, in Proc. Int. Symp. Chin. Spoken Lang. Process. (Singapore), Sept. 2014, pp. 341-344.
L. Li et al., Deep factorization for speech signal, in Proc. IEEE Int. Conf. Acoust., Speech Signal Process. (Calgary, Canada), Apr. 2018, pp. 5094-5098.
Q. Mao et al., Learning salient features for speech emotion recognition using convolutional neural networks, IEEE Trans. Multimed. 16 (2014), 2203-2213. https://doi.org/10.1109/TMM.2014.2360798
K. Han, D. Yu, and I. Tashev, Speech emotion recognition using deep neural network and extreme learning machine, in Proc. Annu. Conf. Int. Speech Commun. Assoc. Sept. 2014, pp. 223-227.
P. Guo, X. Wang, and Y. Han, The enhanced genetic algorithms for the optimization design, in Proc. Int. Conf. Biomed. Eng. Inf. (Yantai, China), Oct. 2010, pp. 2990-2994.
J. Wang, Z. Han, and S. Lun, Speech emotion recognition system based on genetic algorithm and neural network, in Proc. Int. Conf. Image Anal. Signal Process. (Wuhan, China), Oct. 2011, pp. 578-582.
Y. Wang and H. Huo, Speech Recognition based on genetic algorithm optimized support vector machine, in Proc. Int. Conf. Syst. Informatics (Shanghai, China), Nov. 2019, pp. 439-444.
L. Qin, Q. Li, and X. Guan, Pitch extraction for musical signals with modified AMDF, in Proc. Int. Conf. Multimed. Technol. (Hangzhou, China), July 2011, pp. 3599-3602.
M. Jalil, F. A. Butt, and A. Malik, Short-time energy, magnitude, zero crossing rate and autocorrelation measurement for discriminating voiced and unvoiced segments of speech signals, in Proc. Int. Conf. Technol. Adv. Electr., Electron. Comput. Eng. (Konya, Turkey), May 2013, pp. 208-212.
F. Richardson, D. Reynolds, and N. Dehak, Deep neural network approaches to speaker and language recognition, IEEE Signal Process. Lett. 22 (2015), 1671-1675. https://doi.org/10.1109/LSP.2015.2420092
Y. Tian et al., Investigation of bottleneck features and multilingual deep neural networks for speaker verification, in Proc. Annu. Conf. Int. Speech Commun. Assoc. (Dresden, Germany), Sept. 2015, pp. 1151-1155.
X. Zhou, J. Guo, and R. Bie, Deep learning based affective model for speech emotion recognition, in Proc. Int. IEEE Conf. Ubiquitous Intell. Comput. Adv. Trusted Comput. Scalable Comput. Commun. Cloud Big Data Comput. & Internet People Smart World Congr. (Toulouse, France), July 2016, pp. 841-846.
P. Matejka et al., Neural network bottleneck features for language identification, in Proc. Odyssey 2014: Speak. Lang. Recognit. Workshop, (Joensuu, Finland), June 2014, pp. 299-304.
M. McLaren, L. Ferrer, and A. Lawson, Exploring the role of phonetic bottleneck features for speaker and language recognition, in Proc. IEEE Int. Conf. Acoust., Speech Signal Process. (Shanghai, China), Mar. 2016, pp. 5575-5579.
Y. Lei et al., Application of convolutional neural networks to language identification in noisy conditions, in Proc. Speak. Lang. Recognit. Workshop (Joensuu, Finland), June 2014, pp. 287-292.
H. S. Das and P. Roy, Bottleneck feature-based hybrid deep autoencoder approach for Indian language identification, Arab. J. Sci. Eng. 45 (2020), 3425-3436. https://doi.org/10.1007/s13369-020-04430-9
A. Fischer and C. Igel, Bounding the bias of contrastive divergence learning, Neural Comput. 23 (2011), 664-673. https://doi.org/10.1162/NECO_a_00085
L. Chen et al., Speech emotion recognition: Features and classification models, Digit. Signal Process. 22 (2012), 1154-1160. https://doi.org/10.1016/j.dsp.2012.05.007
L. Sun, S. Fu, and F. Wang, Decision tree SVM model with Fisher feature selection for speech emotion recognition, EURASIP J. Aud. Speech Music Process. 2019 (2019), 1-14. https://doi.org/10.1186/s13636-018-0144-6
A. D. Dileep and C. C. Sekhar, GMM-based intermediate matching kernel for classification of varying length patterns of long duration speech using support vector machines, IEEE Trans. Neural Netw. Learn. Syst. 25 (2014), 1421-1432. https://doi.org/10.1109/tnnls.2013.2293512
S. Gupta and A. Mehra, Speech emotion recognition using SVM with thresholding fusion, in Proc. Int. Conf. Signal Process. Integr. Netw. (Noida, India), Feb. 2015, pp. 570-574.
P. Shen, Z. Changjun, and X. Chen, Automatic speech emotion recognition using support vector machine in Proc. Int. Conf. Electron. Mech. Eng. Inf. Technol. (Harbin, China), Aug. 2011, pp. 621-625.
C. Torres-Valencia, M. Alvarez-L opez, and A. Orozco-Gutierrez, SVM-based feature selection methods for emotion recognition from multimodal data, J. Multimodal User Interfaces 11 (2017), 9-23. https://doi.org/10.1007/s12193-016-0222-y
L. M. Saini, S. K. Aggarwal, and A. Kumar, Parameter optimisation using genetic algorithm for support vector machine-based price-forecasting model in National electricity market, IET Gener. Transm. Distrib. 4 (2010), 36-49. https://doi.org/10.1049/iet-gtd.2008.0584
L. Chen et al., Two-layer fuzzy multiple random forest for speech emotion recognition in human-robot interaction, Inf. Sci. 509 (2020), 150-163. https://doi.org/10.1016/j.ins.2019.09.005