Browse > Article
http://dx.doi.org/10.3745/JIPS.02.0161

Audio and Video Bimodal Emotion Recognition in Social Networks Based on Improved AlexNet Network and Attention Mechanism  

Liu, Min (Software School, Hunan Vocational College of Science and Technology)
Tang, Jun (Software School, Hunan Vocational College of Science and Technology)
Publication Information
Journal of Information Processing Systems / v.17, no.4, 2021 , pp. 754-771 More about this Journal
Abstract
In the task of continuous dimension emotion recognition, the parts that highlight the emotional expression are not the same in each mode, and the influences of different modes on the emotional state is also different. Therefore, this paper studies the fusion of the two most important modes in emotional recognition (voice and visual expression), and proposes a two-mode dual-modal emotion recognition method combined with the attention mechanism of the improved AlexNet network. After a simple preprocessing of the audio signal and the video signal, respectively, the first step is to use the prior knowledge to realize the extraction of audio characteristics. Then, facial expression features are extracted by the improved AlexNet network. Finally, the multimodal attention mechanism is used to fuse facial expression features and audio features, and the improved loss function is used to optimize the modal missing problem, so as to improve the robustness of the model and the performance of emotion recognition. The experimental results show that the concordance coefficient of the proposed model in the two dimensions of arousal and valence (concordance correlation coefficient) were 0.729 and 0.718, respectively, which are superior to several comparative algorithms.
Keywords
AlexNet Networks; Attention Mechanism; Concordance Correlation Coefficient; Deep Learning; Feature Layer Fusion; Multimodal Emotion Recognition; Social Networks;
Citations & Related Records
연도 인용수 순위
  • Reference
1 P. Tzirakis, G. Trigeorgis, M. A. Nicolaou, B. W. Schuller, and S. Zafeiriou, "End-to-end multimodal emotion recognition using deep neural networks," IEEE Journal of Selected Topics in Signal Processing, vol. 11, no. 8, pp. 1301-1309, 2017.   DOI
2 J. Huang, Y. Li, J. Tao, and J. Yi, "Multimodal emotion recognition with transfer learning of deep neural network," ZTE Communications, vol. 15, no. S2, pp. 23-29, 2017.
3 S. Zhang, S. Zhang, T. Huang, W. Gao, and Q. Tian, "Learning affective features with a hybrid deep model for audio-visual emotion recognition," IEEE Transactions on Circuits and Systems for Video Technology, vol. 28, no. 10, pp. 3030-3043, 2018.   DOI
4 W. Li, M. Chu, and J. Qiao, "Design of a hierarchy modular neural network and its application in multimodal emotion recognition," Soft Computing, vol. 23, no. 22, pp. 11817-11828, 2019.   DOI
5 M. Valstar, B. Schuller, K. Smith, T. Almaev, F. Eyben, J. Krajewski, R. Cowie, and M. Pantic, "AVEC 2014: 3D dimensional affect and depression recognition challenge," in Proceedings of the 4th International Workshop on Audio/Visual Emotion Challenge, Orlando, FL, 2014, pp. 3-10.
6 S. Poria, E. Cambria, N. Howard, G. B. Huang, and A. Hussain, "Fusing audio, visual and textual clues for sentiment analysis from multimodal content," Neurocomputing, vol. 174, pp. 50-59, 2016.   DOI
7 N. Rasiwasia, J. Costa Pereira, E. Coviello, G. Doyle, G. R. Lanckriet, R. Levy, and N. Vasconcelos, "A new approach to cross-modal multimedia retrieval," in Proceedings of the 18th ACM International Conference on Multimedia, Firenze, Italy, 2010, pp. 251-260.
8 M. Glodek, S. Reuter, M. Schels, K. Dietmayer, and F. Schwenker, "Kalman filter based classifier fusion for affective state recognition," in Multiple Classifier Systems. Heidelberg, Germany: Springer, 2013, pp. 85-94.
9 R. Jamwal, J. Enticott, L. Farnworth, D. Winkler, and L. Callaway, "The use of electronic assistive technology for social networking by people with disability living in shared supported accommodation," Disability and Rehabilitation: Assistive Technology, vol. 15, no. 1, pp. 101-108, 2020.   DOI
10 S. Sharma, G. Singh, and A. S. Aiyub, "Use of social networking sites by SMEs to engage with their customers: a developing country perspective," Journal of Internet Commerce, vol. 19, no. 1, pp. 62-81, 2020.   DOI
11 D. Tiwari and N. Singh, "Ensemble approach for twitter sentiment analysis," International Journal of Information Technology and Computer Science, vol. 11, no. 8, pp. 20-26, 2019.   DOI
12 F. R. Sullivan and P. K. Keith, "Exploring the potential of natural language processing to support microgenetic analysis of collaborative learning discussions," British Journal of Educational Technology, vol. 50, no. 6, pp. 3047-3063, 2019.   DOI
13 R. Bhargava, S. Arora, and Y. Sharma, "Neural network-based architecture for sentiment analysis in Indian languages," Journal of Intelligent Systems, vol. 28, no. 3, pp. 361-375, 2019.   DOI
14 Y. Wang, "Multimodal emotion recognition algorithm based on edge network emotion element compensation and data fusion," Personal and Ubiquitous Computing, vol. 23, no. 3, pp. 383-392, 2019.   DOI
15 Y. Wang, L. Guan, and A. N. Venetsanopoulos, "Kernel cross-modal factor analysis for information fusion with application to bimodal emotion recognition," IEEE Transactions on Multimedia, vol. 14, no. 3, pp. 597-607, 2012.   DOI
16 A. Moussavi-Khalkhali and M. Jamshidi, "Feature fusion models for deep autoencoders: application to traffic flow prediction," Applied Artificial Intelligence, vol. 33, no. 13, pp. 1179-1198, 2019.   DOI
17 J. Ma, Y. Sun, and X. Zhang, "Multimodal emotion recognition for the fusion of speech and EEG signals," Xi'an Dianzi Keji Daxue Xuebao/Journal of Xidian University, vol. 46, no. 1, pp. 143-150, 2019.
18 J. Friedman, T. Hastie, and R. Tibshirani, "Regularization paths for generalized linear models via coordinate descent," Journal of Statistical Software, vol. 33, no. 1, pp. 1-22, 2010.
19 B. Knyazev, R. Shvetsov, N. Efremova, and A. Kuharenko, "Leveraging large face recognition data for emotion classification," in Proceedings of 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG2018), Xi'an, China, 2018, pp. 692-696.
20 J. McDonald, A. C. M. Moskal, A. Goodchild, S. Stein, and S. Terry, "Advancing text-analysis to tap into the student voice: a proof-of-concept study," Assessment & Evaluation in Higher Education, vol. 45, no. 1, pp. 154-164, 2020.   DOI
21 A. L. Afzal and S. Asharaf, "Deep multiple multilayer kernel learning in core vector machines," Expert Systems with Applications, vol. 96, pp. 149-156, 2018.   DOI
22 G. Manogaran, R. Varatharajan, and M. K. Priyan, "Hybrid recommendation system for heart disease diagnosis based on multiple kernel learning with adaptive neuro-fuzzy inference system," Multimedia Tools and Applications, vol. 77, no. 4, pp. 4379-4399, 2018.   DOI
23 Z. Dong and B. Lin, "BMF-CNN: an object detection method based on multi-scale feature fusion in VHR remote sensing images," Remote Sensing Letters, vol. 11, no. 3, pp. 215-224, 2020.   DOI
24 M. Wollmer, F. Weninger, T. Knaup, B. Schuller, C. Sun, K. Sagae, and L. P. Morency, "Youtube movie reviews: sentiment analysis in an audio-visual context," IEEE Intelligent Systems, vol. 28, no. 3, pp. 46-53, 2013.   DOI
25 S. Zhang, X. Wang, G. Zhang, and X. Zhao, "Multimodal emotion recognition integrating affective speech with facial expression," WSEAS Transactions on Signal Processing, vol. 10, pp. 526-537, 2014.
26 S. Dobrisek, R. Gajsek, F. Mihelic, N. Pavesic, and V. Struc, "Towards efficient multi-modal emotion recognition," International Journal of Advanced Robotic Systems, vol. 10, article no. 53, 2013. https://doi.org/10.5772/54002   DOI
27 S. Sahoo and A. Routray, "Emotion recognition from audio-visual data using rule based decision level fusion," in Proceedings of 2016 IEEE Students' Technology Symposium (TechSym), Kharagpur, India, 2016, pp. 7-12.