Multi-Modal Emotion Recognition in Videos Based on Pre-Trained Models

Eun Hee Kim;Ju Hyun Shin;

doi:10.30693/SMJ.2024.13.10.19

Smart Media Journal (스마트미디어저널)

Volume 13 Issue 10
/
Pages.19-27
/
2024
/
2287-1322(pISSN)
/
2288-9671(eISSN)

THE KOREAN INSTITUTE OF SMART MEDIA (한국스마트미디어학회)

DOI QR Code

Multi-Modal Emotion Recognition in Videos Based on Pre-Trained Models

사전학습 모델 기반 발화 동영상 멀티 모달 감정 인식

김은희 (조선대학교 컴퓨터공학과) ;
신주현 (조선대학교 신산업융합학부)

Received : 2024.09.13
Accepted : 2024.10.21
Published : 2024.10.31

https://doi.org/10.30693/SMJ.2024.13.10.19 Citation PDF

Download PDF

⟨ Previous Next ⟩

Abstract

Recently, as the demand for non-face-to-face counseling has rapidly increased, the need for emotion recognition technology that combines various aspects such as text, voice, and facial expressions is being emphasized. In this paper, we address issues such as the dominance of non-Korean data and the imbalance of emotion labels in existing datasets like FER-2013, CK+, and AFEW by using Korean video data. We propose methods to enhance multimodal emotion recognition performance in videos by integrating the strengths of image modality with text modality. A pre-trained model is used to overcome the limitations caused by small training data. A GPT-4-based LLM model is applied to text, and a pre-trained model based on VGG-19 architecture is fine-tuned to facial expression images. The method of extracting representative emotions by combining the emotional results of each aspect extracted using a pre-trained model is as follows. Emotion information extracted from text was combined with facial expression changes in a video. If there was a sentiment mismatch between the text and the image, we applied a threshold that prioritized the text-based sentiment if it was deemed trustworthy. Additionally, as a result of adjusting representative emotions using emotion distribution information for each frame, performance was improved by 19% based on F1-Score compared to the existing method that used average emotion values for each frame.

최근 비대면 상담의 수요가 급증하면서, 텍스트뿐만 아니라 음성, 얼굴 표정 등 다양한 모달리티를 결합한 감정 인식 기술의 필요성이 강조되고 있다. 본 논문에서는 FER-2013, CK+, AFEW와 같은 기존 데이터셋의 외국인 중심, 감정 라벨 불균형 등의 문제를 해결하기 위해 한국어 동영상 데이터를 활용하고, 텍스트 모달리티를 기반으로 이미지 모달리티의 장점을 결합하여 동영상에서 멀티모달 감정 인식의 성능을 향상시키는 방법을 제안하고자 한다. 적은 데이터 학습 데이터로 인한 한계를 극복하기 위해 사전학습 모델을 활용하였는데, 텍스트는 GPT-4 기반의 LLM 모델을 적용하고, 얼굴 표정 이미지는 VGG-19 아키텍처 기반의 사전학습 모델을 파인튜닝하여 적용하였다. 사전 학습을 활용하여 추출된 각 모달리티별 감정 결과를 결합하여 대표 감정을 추출하는 방법은 텍스트에서 추출한 감정 정보와 동영상에서의 얼굴 표정 변화를 결합하는 방법으로 텍스트와 이미지 간 감정 불일치 상황에서 임곗값을 적용하여 텍스트 기반 감정을 신뢰할 수 있을 때 우선 선택하는 전략과 프레임별 감정 분포 정보를 활용하여 대표 감정을 조정하는 전략을 적용하여 기존 프레임별 감정 평균값을 사용하는 방법에 비해 F1-Score를 기준으로 19%의 성능을 향상시킬 수 있었다.

Keywords

Acknowledgement

이 논문은 정부(과학기술정보통신부)의 재원으로 한국연구재단의 지원을 받아 수행된 연구이며(No. 2023R1A2C1006149), 조선대학교 학술연구비의 지원을 받아 연구되었음(2024년)

References

S. Zhang, L. Liu, L. Li, and Z. Wang, "Deep Learning-Based Multimodal Emotion Recognition from Audio, Visual, and Text Modalities: A Systematic Review of Recent Advancements and Future Prospects," Expert Systems with Applications, vol. 237, pp. 121692, 2024.
이영주, 유미혜, 임영화, "비대면 심리상담의 내담자 경험에 대한 탐색: 원격상담 모바일 앱을 중심으로," Journal of Learner-Centered Curriculum and Instruction, 제22권, 제9호, 707-732쪽, 2022년
A.-L. Cirneanu, D. Popescu, and D. Iordache, "New Trends in Emotion Recognition Using Image Analysis by Neural Networks: A Systematic Review," Sensors, vol. 23, no. 16, pp. 7092, 2023.
김재성;, 이수안, "서로 다른 언어 모델의 상징적 지식 증류를 이용한 경량화된 감정 분석 모델," 한국정보과학회 학술발표논문집, 1529-1531쪽, 2023년 12월
김영석, 이창우. "거대 언어 모델 기반 리뷰 글 평점 예측에 관한 연구," 한국통신학회 학술대회논문집, 1075-1076쪽, 2023년 11월
고혁훈, 주성호, 정교민. "대화 맥락과 사전학습 정보를 통한 멀티 모달 감정 인식 : 텍스트와 오디오를 중심으로," 한국정보과학회 학술발표논문집, 제주, 2136-2138쪽, 2023년 6월
Zhao S., Cai H., Liu H., Zhang J., Chen S., "Feature Selection Mechanism in CNNs for Facial Expression Recognition," In Proceedings of the British Machine Vision Conference (BMVC), pp. 1-12, Newcastle, UK, Sep. 2018.
Krizhevsky A., Sutskever I., Hinton G.E., "ImageNet Classification with Deep Convolutional Neural Networks," In Proceedings of the Annual Conference on Neural Information Processing Systems (NeurIPS), pp. 1-9., Lake Tahoe, NV, USA, Dec. 2012.
Ekman, Paul, Wallace V. Friesen. "Facial action coding system," Environmental Psychology & Nonverbal Behavior, 1978.
조찬영, 정현준. "얼굴 동영상과 다차원 감정 기반의 텍스트를 이용한 멀티모달 감정인식 시스템." 한국정보기술학회논문지, 제21권, 제5호, 39-47쪽, 2023년 5월.
김성호, 송병철. "상호 정보 관점을 통한 얼굴 표정 조작을 위한 감정 세기에 따른 대조 학습," 전자공학회논문지, 제61권, 제1호, 71-74쪽, 2024년 1월
deepface(2019) https://github.com/serengil/deepface, (accessed Sep., 2, 2024).
OpenAI API(2015) https://platform.openai.com/docs/models, (accessed Sep., 2, 2024).
감정 분류용 데이터셋(2018) https://www.aihub.or.kr/aihubdata/data/view.do?currMenu=115&topMenu=100&dataSetSn=259, (accessed Sep., 2, 2024).

Smart Media Journal (스마트미디어저널)

Multi-Modal Emotion Recognition in Videos Based on Pre-Trained Models

사전학습 모델 기반 발화 동영상 멀티 모달 감정 인식

Abstract

Keywords

Acknowledgement

References

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)