• 제목/요약/키워드: multimodal fusion

검색결과 53건 처리시간 0.026초

Audio and Video Bimodal Emotion Recognition in Social Networks Based on Improved AlexNet Network and Attention Mechanism

  • Liu, Min;Tang, Jun
    • Journal of Information Processing Systems
    • /
    • 제17권4호
    • /
    • pp.754-771
    • /
    • 2021
  • In the task of continuous dimension emotion recognition, the parts that highlight the emotional expression are not the same in each mode, and the influences of different modes on the emotional state is also different. Therefore, this paper studies the fusion of the two most important modes in emotional recognition (voice and visual expression), and proposes a two-mode dual-modal emotion recognition method combined with the attention mechanism of the improved AlexNet network. After a simple preprocessing of the audio signal and the video signal, respectively, the first step is to use the prior knowledge to realize the extraction of audio characteristics. Then, facial expression features are extracted by the improved AlexNet network. Finally, the multimodal attention mechanism is used to fuse facial expression features and audio features, and the improved loss function is used to optimize the modal missing problem, so as to improve the robustness of the model and the performance of emotion recognition. The experimental results show that the concordance coefficient of the proposed model in the two dimensions of arousal and valence (concordance correlation coefficient) were 0.729 and 0.718, respectively, which are superior to several comparative algorithms.

Multimodal audiovisual speech recognition architecture using a three-feature multi-fusion method for noise-robust systems

  • Sanghun Jeon;Jieun Lee;Dohyeon Yeo;Yong-Ju Lee;SeungJun Kim
    • ETRI Journal
    • /
    • 제46권1호
    • /
    • pp.22-34
    • /
    • 2024
  • Exposure to varied noisy environments impairs the recognition performance of artificial intelligence-based speech recognition technologies. Degraded-performance services can be utilized as limited systems that assure good performance in certain environments, but impair the general quality of speech recognition services. This study introduces an audiovisual speech recognition (AVSR) model robust to various noise settings, mimicking human dialogue recognition elements. The model converts word embeddings and log-Mel spectrograms into feature vectors for audio recognition. A dense spatial-temporal convolutional neural network model extracts features from log-Mel spectrograms, transformed for visual-based recognition. This approach exhibits improved aural and visual recognition capabilities. We assess the signal-to-noise ratio in nine synthesized noise environments, with the proposed model exhibiting lower average error rates. The error rate for the AVSR model using a three-feature multi-fusion method is 1.711%, compared to the general 3.939% rate. This model is applicable in noise-affected environments owing to its enhanced stability and recognition rate.

Multimodal Biometrics Recognition from Facial Video with Missing Modalities Using Deep Learning

  • Maity, Sayan;Abdel-Mottaleb, Mohamed;Asfour, Shihab S.
    • Journal of Information Processing Systems
    • /
    • 제16권1호
    • /
    • pp.6-29
    • /
    • 2020
  • Biometrics identification using multiple modalities has attracted the attention of many researchers as it produces more robust and trustworthy results than single modality biometrics. In this paper, we present a novel multimodal recognition system that trains a deep learning network to automatically learn features after extracting multiple biometric modalities from a single data source, i.e., facial video clips. Utilizing different modalities, i.e., left ear, left profile face, frontal face, right profile face, and right ear, present in the facial video clips, we train supervised denoising auto-encoders to automatically extract robust and non-redundant features. The automatically learned features are then used to train modality specific sparse classifiers to perform the multimodal recognition. Moreover, the proposed technique has proven robust when some of the above modalities were missing during the testing. The proposed system has three main components that are responsible for detection, which consists of modality specific detectors to automatically detect images of different modalities present in facial video clips; feature selection, which uses supervised denoising sparse auto-encoders network to capture discriminative representations that are robust to the illumination and pose variations; and classification, which consists of a set of modality specific sparse representation classifiers for unimodal recognition, followed by score level fusion of the recognition results of the available modalities. Experiments conducted on the constrained facial video dataset (WVU) and the unconstrained facial video dataset (HONDA/UCSD), resulted in a 99.17% and 97.14% Rank-1 recognition rates, respectively. The multimodal recognition accuracy demonstrates the superiority and robustness of the proposed approach irrespective of the illumination, non-planar movement, and pose variations present in the video clips even in the situation of missing modalities.

바이오 응용을 위한 초음파 및 광학 기반 다중 모달 영상 기술 (Ultrasound-optical imaging-based multimodal imaging technology for biomedical applications)

  • 이문환;박희연;이경수;김세웅;김지훈;황재윤
    • 한국음향학회지
    • /
    • 제42권5호
    • /
    • pp.429-440
    • /
    • 2023
  • 이 연구는 초음파 광학 영상 기반의 다중 모달 영상 기술에 대한 최신 연구 동향과 응용 가능성에 대해 조사하였다. 초음파 영상은 실시간 영상 기능을 가지고 있으며 인체에 상대적으로 안전한 특성으로 인해 의료 분야에서 다양한 질병의 진단에 사용되고 있다. 그러나 초음파 영상은 해상도가 낮은 한계가 있어 진단 정확도를 향상시키기 위해 다른 광학 영상과의 결합을 통한 다중 모달 영상 기술 개발 연구가 진행되고 있다. 특히 초음파 광학 영상 기반의 다중 모달 영상 기술은 각각의 영상 기법의 장점을 극대화하고 단점을 보완함으로써 질병 진단 정확도를 향상시킬 수 있는 수단으로 사용되고 있다. 이러한 기술은 초음파의 실시간 영상 기능과 광간섭 단층 영상 융합 기술, 초음파 광음향 다중 모달 영상 기술, 초음파 형광 다중 모달 영상 기술, 초음파 형광 시정수 다중 모달 영상 기술 및 초음파 분광 다중 모달 영상 기술 등 다양한 형태로 제안되고 있다. 본 연구에서는 이러한 초음파 광학 영상 기반의 다중 모달 영상 기술의 최신 연구 동향을 소개하고, 의학 및 바이오 분야에서의 응용 가능성을 조사하였다. 이를 통해 초음파와 광학 기술의 융합이 어떻게 진행되고 있는지에 대한 통찰력을 제공하고, 의료 분야에서의 진단 정확도 향상을 위한 새로운 접근 방식에 대한 기반을 마련하였다.

멀티모달 인터페이스를 사용한 웹 게임 시스템의 구현 (Implementation of Web Game System using Multi Modal Interfaces)

  • 이준;안영석;김지인;박성준
    • 한국게임학회 논문지
    • /
    • 제9권6호
    • /
    • pp.127-137
    • /
    • 2009
  • 웹 게임은 웹 브라우저를 통해 게임을 즐길 수 있도록 해주는 게임의 한 종류로써 편리한 접근성 및 대용량의 게임데이터를 다운로드가 필요하지 않는 장점을 가지고 있다. 이러한 웹 게임은 최근 모바일 기기의 발전과 웹 2.0 시대를 맞아 새로운 성장의 기회를 가지고 있다. 본 연구에서는 이러한 웹 게임에 사용자의 직관적인 조작이 가능한 멀티모달 인터페이스 및 모바일 기기를 연동할 수 있는 새로운 형태의 시스템을 제안 한다. 본 논문에서는 웹 게임의 인터페이스로써 멀티모달 인터페이스인 Wii를 사용하였으며, 여러 사용자들이 일반 PC 및 UMPC와 같은 모바일 기기를 통해서도 게임을 즐길 수 있는 구조를 설계 하였다. 본 논문에서 제안된 시스템을 평가하기 위해 기존의 방법으로 웹 게임을 즐길 때와 멀티모달 인터페이스를 사용하는 경우에 따른 성능 평가 및 사용자 평가를 하였으며, 실험 결과 모바일 기기에서 멀티모달 인터페이스를 사용한 경우에 게임 클리어 시간 및 에러가 감소하는 결과를 얻었으며 사용자들의 흥미도 또한 가장 높게 나왔다.

  • PDF

감정 인지를 위한 음성 및 텍스트 데이터 퓨전: 다중 모달 딥 러닝 접근법 (Speech and Textual Data Fusion for Emotion Detection: A Multimodal Deep Learning Approach)

  • 에드워드 카야디;송미화
    • 한국정보처리학회:학술대회논문집
    • /
    • 한국정보처리학회 2023년도 추계학술발표대회
    • /
    • pp.526-527
    • /
    • 2023
  • Speech emotion recognition(SER) is one of the interesting topics in the machine learning field. By developing multi-modal speech emotion recognition system, we can get numerous benefits. This paper explain about fusing BERT as the text recognizer and CNN as the speech recognizer to built a multi-modal SER system.

얼굴과 발걸음을 결합한 인식 (Fusion algorithm for Integrated Face and Gait Identification)

  • Nizami, Imran Fareed;Hong, Sug-Jun;Lee, Hee-Sung;Ann, Toh-Kar;Kim, Eun-Tai;Park, Mig-Non
    • 한국지능시스템학회:학술대회논문집
    • /
    • 한국지능시스템학회 2007년도 추계학술대회 학술발표 논문집
    • /
    • pp.15-18
    • /
    • 2007
  • Identification of humans from multiple view points is an important task for surveillance and security purposes. For optimal performance the system should use the maximum information available from sensors. Multimodal biometric systems are capable of utilizing more than one physiological or behavioral characteristic for enrollment, verification, or identification. Since gait alone is not yet established as a very distinctive feature, this paper presents an approach to fuse face and gait for identification. In this paper we will use the single camera case i.e. both the face and gait recognition is done using the same set of images captured by a single camera. The aim of this paper is to improve the performance of the system by utilizing the maximum amount of information available in the images. Fusion is considered at decision level. The proposed algorithm is tested on the NLPR database.

  • PDF

Using Keystroke Dynamics for Implicit Authentication on Smartphone

  • Do, Son;Hoang, Thang;Luong, Chuyen;Choi, Seungchan;Lee, Dokyeong;Bang, Kihyun;Choi, Deokjai
    • 한국멀티미디어학회논문지
    • /
    • 제17권8호
    • /
    • pp.968-976
    • /
    • 2014
  • Authentication methods on smartphone are demanded to be implicit to users with minimum users' interaction. Existing authentication methods (e.g. PINs, passwords, visual patterns, etc.) are not effectively considering remembrance and privacy issues. Behavioral biometrics such as keystroke dynamics and gait biometrics can be acquired easily and implicitly by using integrated sensors on smartphone. We propose a biometric model involving keystroke dynamics for implicit authentication on smartphone. We first design a feature extraction method for keystroke dynamics. And then, we build a fusion model of keystroke dynamics and gait to improve the authentication performance of single behavioral biometric on smartphone. We operate the fusion at both feature extraction level and matching score level. Experiment using linear Support Vector Machines (SVM) classifier reveals that the best results are achieved with score fusion: a recognition rate approximately 97.86% under identification mode and an error rate approximately 1.11% under authentication mode.

실수형 퍼지볼트를 이용한 다중 바이오인식 시스템 (Multimodal Biometric Recognition System using Real Fuzzy Vault)

  • 이대종;전명근
    • 한국지능시스템학회논문지
    • /
    • 제23권4호
    • /
    • pp.310-316
    • /
    • 2013
  • 바이오인식 시스템은 변하지 않는 고유의 특성으로 인하여 범죄를 포함한 다양한 분야에서 널리 사용되고 있다. 그러나 바이오인식정보가 불법 사용자에게 누설되었을 때 많은 문제점을 지니고 있다. 본 논문에서는 지문과 얼굴 정보를 보호하기 위하여 실수형 오류정보 부호 코드화를 수행하는 실수형 퍼지 볼트를 이용한 다중 바이오 인식 시스템을 개발한다. 제안된 방법은 실수형 퍼지볼트를 이용함으로써 분실시 재생성할 수 없는 지문 및 얼굴 특징값과 달리 개인 킷값을 수시로 변경할 수 있다는 장점과 두 가지 바이오정보를 융합함으로써 보안이 강화된 바이오인식 시스템을 구현할 수 있다는 장점이 있다. 제안된 방법의 타당성을 검증하기 위하여 실험한 결과 기존 방법에 비하여 우수한 결과를 나타냈다.

Intelligent Hybrid Fusion Algorithm with Vision Patterns for Generation of Precise Digital Road Maps in Self-driving Vehicles

  • Jung, Juho;Park, Manbok;Cho, Kuk;Mun, Cheol;Ahn, Junho
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • 제14권10호
    • /
    • pp.3955-3971
    • /
    • 2020
  • Due to the significant increase in the use of autonomous car technology, it is essential to integrate this technology with high-precision digital map data containing more precise and accurate roadway information, as compared to existing conventional map resources, to ensure the safety of self-driving operations. While existing map technologies may assist vehicles in identifying their locations via Global Positioning System, it is however difficult to update the environmental changes of roadways in these maps. Roadway vision algorithms can be useful for building autonomous vehicles that can avoid accidents and detect real-time location changes. We incorporate a hybrid architectural design that combines unsupervised classification of vision data with supervised joint fusion classification to achieve a better noise-resistant algorithm. We identify, via a deep learning approach, an intelligent hybrid fusion algorithm for fusing multimodal vision feature data for roadway classifications and characterize its improvement in accuracy over unsupervised identifications using image processing and supervised vision classifiers. We analyzed over 93,000 vision frame data collected from a test vehicle in real roadways. The performance indicators of the proposed hybrid fusion algorithm are successfully evaluated for the generation of roadway digital maps for autonomous vehicles, with a recall of 0.94, precision of 0.96, and accuracy of 0.92.