Browse > Article
http://dx.doi.org/10.5392/IJoC.2022.18.3.011

Multimodal Attention-Based Fusion Model for Context-Aware Emotion Recognition  

Vo, Minh-Cong (Dept of Artificial Intelligence Convergence, Chonnam National University)
Lee, Guee-Sang (Dept of Artificial Intelligence Convergence, Chonnam National University)
Publication Information
Abstract
Human Emotion Recognition is an exciting topic that has been attracting many researchers for a lengthy time. In recent years, there has been an increasing interest in exploiting contextual information on emotion recognition. Some previous explorations in psychology show that emotional perception is impacted by facial expressions, as well as contextual information from the scene, such as human activities, interactions, and body poses. Those explorations initialize a trend in computer vision in exploring the critical role of contexts, by considering them as modalities to infer predicted emotion along with facial expressions. However, the contextual information has not been fully exploited. The scene emotion created by the surrounding environment, can shape how people perceive emotion. Besides, additive fusion in multimodal training fashion is not practical, because the contributions of each modality are not equal to the final prediction. The purpose of this paper was to contribute to this growing area of research, by exploring the effectiveness of the emotional scene gist in the input image, to infer the emotional state of the primary target. The emotional scene gist includes emotion, emotional feelings, and actions or events that directly trigger emotional reactions in the input image. We also present an attention-based fusion network, to combine multimodal features based on their impacts on the target emotional state. We demonstrate the effectiveness of the method, through a significant improvement on the EMOTIC dataset.
Keywords
Context-aware; Multimodal fusion; Attention-based fusion; Emotion recognition; Deep learning;
Citations & Related Records
연도 인용수 순위
  • Reference
1 H. Gunes and P. Massimo, "Bi-modal emotion recognition from expressive face and body gestures, "Journal of Network and Computer Applications, vol. 30, no. 4, pp. 1334-1345, 2007. doi: https://doi.org/10.1016/j.jnca.2006.09.007.   DOI
2 K. Wang, X. Zeng, J. Yang, and D. Meng, "Cascade attention networks for group emotion recognition with face, body and image cues, "Proceedings of the 20th ACM International Conference on Multimodal Interaction, 2018. doi: https://doi.org/10.1145/3242969.3264991.   DOI
3 S. Li, D. Weihong, and D. JunPing, "Reliable crowdsourcing and deep locality-preserving learning for expression recognition in the wild, "Proceedings of the IEEE conference on computer vision and pattern recognition, 2017. doi: https://doi.org/10.1109/CVPR.2017.277.   DOI
4 K. M. Meeren, C. R. J. Corne, and B. Gelder, "Rapid perceptual integration of facial expression and emotional body language, "Proceedings of the National Academy of Sciences, vol. 102, no. 45, pp. 16518-16523, 2005. doi: https://doi.org/10.1073%2Fpnas.0507650102.   DOI
5 K. H. Greenaway, E. K. Kalolerinos, and L. A. Williams, "Context is everything (in emotion research), "Social and Personality Psychology Compass, vol. 12, no. 6, p. e12393, 2018. doi: https://doi.org/10.1111/spc3.12393.   DOI
6 L. F. Barrett, B. Mesquita, and M. Gendron, "Context in emotion perception, "Current Directions in Psychological Science, vol. 20, no. 5, pp. 286-290, 2011. doi: https://doi.org/10.1177/0963721411422522.   DOI
7 A. M. Martinez, "Context may reveal how you feel, "Proceedings of the National Academy of Sciences, vol. 116, no. 15, pp. 7169-7171, 2019. doi: https://doi.org/10.1073/pnas.1902661116.   DOI
8 B. Mesquita and M. Boiger, "Emotions in context: A sociodynamic model of emotions, "Emotion Review, vol. 6, no. 4, pp. 298-302, 2014. doi: https://doi.org/10.1177/1754073914534480.   DOI
9 R. Kosti, J. M. Alvarez, A. Recasens, and A. Lapedriza, "Context based emotion recognition using emotic dataset," IEEE transactions on pattern analysis and machine intelligence, vol. 42, no. 11, pp. 2755-2766, 2019. doi: https://doi.org/10.1109/TPAMI.2019.2916866.   DOI
10 M. Zhang, Y. Liang, and H. Ma, "Context-aware affective graph reasoning for emotion recognition, "2019 IEEE International Conference on Multimedia and Expo (ICME) IEEE, 2019. doi: https://doi.org/10.1109/ICME.2019.00034.   DOI
11 A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomea, L. Kaiser, and L. Polosukhin, "Attention is all you need, "2017. arXiv preprint arXiv:1706.03762
12 Y. L. Kim, H. L. Lee, and Emily Mower Provost, "Deep learning for robust feature generation in audiovisual emotion recognition, "2013 IEEE international conference on acoustics, speech and signal processing. IEEE, 2013. doi: https://doi.org/10.1109/ICASSP.2013.6638346.   DOI
13 K. Liu, Y. Li, N. Xu, and P. Natarajan, "Learn to combine modalities in multimodal deep learning, "2018. arXiv preprint arXiv:1805.11730.
14 Peelen, Marius V., and Paul E. Downing, "The neural basis of visual body perception, "Nature reviews neuroscience 8.8 (2007): 636-648, doi: https://doi.org/10.1038/nrn2195.   DOI
15 Yamamoto, Kyoko, and Naoto Suzuki, "The effects of social interaction and personal relationships on facial expressions, "Journal of Nonverbal Behavior 30.4 (2006): 167-179, doi: https://doi.org/10.1007/s10919-006-0015-1.   DOI
16 Grezes, Julie, Swann Pichon, and Beatrice De Gelder, "Perceiving fear in dynamic body expressions, "Neuroimage 35.2 (2007): 959-967, doi: https://doi.org/10.1016/j.neuroimage.2006.11.030.   DOI
17 B. Gelder, "Towards the neurobiology of emotional body language, "Nature Reviews Neuroscience, vol. 7, no. 3, pp. 242-249, 2006. doi: https://doi.org/10.1038/nrn1872.   DOI
18 W. Zijun, Z. Jianming, L. Zhe, J. Y. Lee, B. niranjan, M. Hoai, and D. Samaras, "Learning visual emotion representations from web data, "Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020. doi: https://doi.org/10.1109/CVPR42600.2020.01312.   DOI
19 S. Konrad, L. Gool, and B. Gelder, "Recognizing emotions expressed by body pose: A biologically inspired neural model, "Neural networks vol. 21, no. 9, pp. 1238-1246, 2008. doi: https://doi.org/10.1016/j.neunet.2008.05.003.   DOI
20 E. Jakobs, A. S. Manstead, and A. H. Fisxher, "Social context effects on facial activity in a negative emotional setting, "Emotion, vol. 1, no. 1, p. 51, 2001. doi: https://doi.org/10.1037/1528-3542.1.1.51.   DOI
21 K. Xu, J. Ba, R. Kiros, K. H. Cho, A. Courville, R. Salakhutdinov, R. Zemel, and Y. Bengio, "Show, attend and tell: Neural image caption generation with visual attention, "International conference on machine learning, PMLR, 2015. doi: https://dl.acm.org/doi/10.5555/3045118.3045336.   DOI
22 C. W. Lee, K. Y. Song, J. H. Jeong, and W. Y. Choi, "Convolutional attention networks for multimodal emotion recognition from speech and text data, "ACL, 2018., doi: http://dx.doi.org/10.18653/v1/W18-3304.   DOI
23 H. Hu, J. Gu, Z. Zhang, J. Dai, and Y. Wei, "Relation networks for object detection, "Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018. doi: https://doi.ieeecomputersociety.org/10.1109/CVPR.2018.00378.   DOI
24 L. F. Barrett, "How emotions are made: The secret life of the brain, "Houghton Mifflin Harcourt, 2017. doi: https://psycnet.apa.org/doi/10.1037/teo0000098.   DOI
25 A. Aldao, "The future of emotion regulation research: Capturing context, "Perspectives on Psychological Science, vol. 8, no. 2, pp. 155-172, 2013. doi: https://doi.org/10.1177/1745691612459518.   DOI
26 J. Y. Lee, S. R. Kim, S. O. Kim, J. G. Park, and K. H. Sohn, "Context-aware emotion recognition networks," Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019. doi: https://doi.org/10.1109/ICCV.2019.01024.   DOI
27 A. Kleinsmith, N. Bianchi-Berthouze, and A. Steed, "Automatic recognition of non-acted affective postures, "IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), vol. 41, no.4, pp. 1027-1038, 2011. doi: https://doi.org/10.1109/TSMCB.2010.2103557.   DOI
28 J. Deng, W. Dong, R. Socher, L. Li, L. Kai, and L. Fei-Fei, "Imagenet: A large-scale hierarchical image database," 2009 IEEE conference on computer vision and pattern recognition, IEEE, 2009, doi: https://doi.org/10.1109/CVPR.2009.5206848.   DOI
29 De Gelder, Beatrice, "Towards the neurobiology of emotional body language, "Nature Reviews Neuroscience 7.3 (2006): 242-249, doi: https://doi.org/10.1038/nrn1872.   DOI
30 S. H. Yoon, S. H. Byun, S. Dey, and K. M. Jung, "Speech emotion recognition using multi-hop attention mechanism, "ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019. doi: https://doi.org/10.1109/ICASSP.2019.8683483.   DOI
31 Selvaraju, R. Ramprasaath, M. Cogswell, D. Abhishek, V. Ramakrishna, D. Parikh, and B. Dhruv, "Grad-cam: Visual explanations from deep networks via gradient-based localization, "Proceedings of the IEEE international conference on computer vision, 2017, doi: https://doi.org/10.1109/ICCV.2017.74.   DOI
32 B. Zhou, A. Khosla, A. Lapedriza, A. Torralba, and A. Oliva, "Places: An image database for deep scene understanding, "2016. arXiv preprint arXiv:1610.02055
33 Paszke, Adam, et al, "Automatic differentiation in pytorch, "NIPS(2017)
34 H. Kaiming, Z. Xiangyu, R. Shaoqing, and S. Jian, "Deep residual learning for image recognition, "Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, doi: https://doi.org/10.1109/CVPR.2016.90.   DOI
35 L. Tsung-Yi, M. Maichael, B. Serge, B. Lubomir, G. Ross, H. James, P. Pietro, R. Deva, C. Z. Lawrence, and D. Piotr, "Microsoft coco: Common objects in context, "European conference on computer vision. Springer, Cham, 2014, doi: https://doi.org/10.1007/978-3-319-10602-1_48.   DOI
36 K. Sikka, K. Dykstra, S. Suchitra, and L. Gwen, "Multiple kernel learning for emotion recognition in the wild," Proceedings of the 15th ACM on International conference on multimodal interaction, 2013. doi: https://dl.acm.org/doi/10.1145/2522848.2531741.   DOI
37 B. Zhou, Z. Hang, P. Xavier, X. Tete, F. Sanja, B. Adela, and T. Antonio, "Semantic understanding of scenes through the ade20k dataset, "International Journal of Computer Vision, vol. 127, no. 3, pp. 302-321, 2019. doi: https://doi.org/10.1007/s11263-018-1140-0.   DOI
38 J. Fu, J. Liu, H. Tian, Y. Li, Y. Bao, Z. Fang, and H. Lu, "Dual attention network for scene segmentation," Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019. doi: https://doi.org/10.1109/CVPR.2019.00326.   DOI