• Title/Summary/Keyword: Audio-visual integration

Search Result 27, Processing Time 0.026 seconds

A Novel Integration Scheme for Audio Visual Speech Recognition

  • Pham, Than Trung;Kim, Jin-Young;Na, Seung-You
    • The Journal of the Acoustical Society of Korea
    • /
    • v.28 no.8
    • /
    • pp.832-842
    • /
    • 2009
  • Automatic speech recognition (ASR) has been successfully applied to many real human computer interaction (HCI) applications; however, its performance tends to be significantly decreased under noisy environments. The invention of audio visual speech recognition (AVSR) using an acoustic signal and lip motion has recently attracted more attention due to its noise-robustness characteristic. In this paper, we describe our novel integration scheme for AVSR based on a late integration approach. Firstly, we introduce the robust reliability measurement for audio and visual modalities using model based information and signal based information. The model based sources measure the confusability of vocabulary while the signal is used to estimate the noise level. Secondly, the output probabilities of audio and visual speech recognizers are normalized respectively before applying the final integration step using normalized output space and estimated weights. We evaluate the performance of our proposed method via Korean isolated word recognition system. The experimental results demonstrate the effectiveness and feasibility of our proposed system compared to the conventional systems.

Human-Robot Interaction in Real Environments by Audio-Visual Integration

  • Kim, Hyun-Don;Choi, Jong-Suk;Kim, Mun-Sang
    • International Journal of Control, Automation, and Systems
    • /
    • v.5 no.1
    • /
    • pp.61-69
    • /
    • 2007
  • In this paper, we developed not only a reliable sound localization system including a VAD(Voice Activity Detection) component using three microphones but also a face tracking system using a vision camera. Moreover, we proposed a way to integrate three systems in the human-robot interaction to compensate errors in the localization of a speaker and to reject unnecessary speech or noise signals entering from undesired directions effectively. For the purpose of verifying our system's performances, we installed the proposed audio-visual system in a prototype robot, called IROBAA(Intelligent ROBot for Active Audition), and demonstrated how to integrate the audio-visual system.

Constructing a Noise-Robust Speech Recognition System using Acoustic and Visual Information (청각 및 시가 정보를 이용한 강인한 음성 인식 시스템의 구현)

  • Lee, Jong-Seok;Park, Cheol-Hoon
    • Journal of Institute of Control, Robotics and Systems
    • /
    • v.13 no.8
    • /
    • pp.719-725
    • /
    • 2007
  • In this paper, we present an audio-visual speech recognition system for noise-robust human-computer interaction. Unlike usual speech recognition systems, our system utilizes the visual signal containing speakers' lip movements along with the acoustic signal to obtain robust speech recognition performance against environmental noise. The procedures of acoustic speech processing, visual speech processing, and audio-visual integration are described in detail. Experimental results demonstrate the constructed system significantly enhances the recognition performance in noisy circumstances compared to acoustic-only recognition by using the complementary nature of the two signals.

Audio-Visual Content Analysis Based Clustering for Unsupervised Debate Indexing (비교사 토론 인덱싱을 위한 시청각 콘텐츠 분석 기반 클러스터링)

  • Keum, Ji-Soo;Lee, Hyon-Soo
    • The Journal of the Acoustical Society of Korea
    • /
    • v.27 no.5
    • /
    • pp.244-251
    • /
    • 2008
  • In this research, we propose an unsupervised debate indexing method using audio and visual information. The proposed method combines clustering results of speech by BIC and visual by distance function. The combination of audio-visual information reduces the problem of individual use of speech and visual information. Also, an effective content based analysis is possible. We have performed various experiments to evaluate the proposed method according to use of audio-visual information for five types of debate data. From experimental results, we found that the effect of audio-visual integration outperforms individual use of speech and visual information for debate indexing.

A 3D Audio-Visual Animated Agent for Expressive Conversational Question Answering

  • Martin, J.C.;Jacquemin, C.;Pointal, L.;Katz, B.
    • 한국정보컨버전스학회:학술대회논문집
    • /
    • 2008.06a
    • /
    • pp.53-56
    • /
    • 2008
  • This paper reports on the ACQA(Animated agent for Conversational Question Answering) project conducted at LIMSI. The aim is to design an expressive animated conversational agent(ACA) for conducting research along two main lines: 1/ perceptual experiments(eg perception of expressivity and 3D movements in both audio and visual channels): 2/ design of human-computer interfaces requiring head models at different resolutions and the integration of the talking head in virtual scenes. The target application of this expressive ACA is a real-time question and answer speech based system developed at LIMSI(RITEL). The architecture of the system is based on distributed modules exchanging messages through a network protocol. The main components of the system are: RITEL a question and answer system searching raw text, which is able to produce a text(the answer) and attitudinal information; this attitudinal information is then processed for delivering expressive tags; the text is converted into phoneme, viseme, and prosodic descriptions. Audio speech is generated by the LIMSI selection-concatenation text-to-speech engine. Visual speech is using MPEG4 keypoint-based animation, and is rendered in real-time by Virtual Choreographer (VirChor), a GPU-based 3D engine. Finally, visual and audio speech is played in a 3D audio and visual scene. The project also puts a lot of effort for realistic visual and audio 3D rendering. A new model of phoneme-dependant human radiation patterns is included in the speech synthesis system, so that the ACA can move in the virtual scene with realistic 3D visual and audio rendering.

  • PDF

Comparison of Integration Methods of Speech and Lip Information in the Bi-modal Speech Recognition (바이모달 음성인식의 음성정보와 입술정보 결합방법 비교)

  • 박병구;김진영;최승호
    • The Journal of the Acoustical Society of Korea
    • /
    • v.18 no.4
    • /
    • pp.31-37
    • /
    • 1999
  • A bimodal speech recognition using visual and audio information has been proposed and researched to improve the performance of ASR(Automatic Speech Recognition) system in noisy environments. The integration method of two modalities can be usually classified into an early integration and a late integration. The early integration method includes a method using a fixed weight of lip parameters and a method using a variable weight according to speech SNR information. The 4 late integration methods are a method using audio and visual information independently, a method using speech optimal path, a method using lip optimal path and a way using speech SNR information. Among these 6 methods, the method using the fixed weight of lip parameter showed a better recognition rate.

  • PDF

The Influence of SOA between the Visual and Auditory Stimuli with Semantic Properties on Integration of Audio-Visual Senses -Focus on the Redundant Target Effect and Visual Dominance Effect- (의미적 속성을 가진 시.청각자극의 SOA가 시청각 통합 현상에 미치는 영향 -중복 표적 효과와 시각 우세성 효과를 중심으로-)

  • Kim, Bo-Seong;Lee, Young-Chang;Lim, Dong-Hoon;Kim, Hyun-Woo;Min, Yoon-Ki
    • Science of Emotion and Sensibility
    • /
    • v.13 no.3
    • /
    • pp.475-484
    • /
    • 2010
  • This study examined the influence of the SOA(stimulus onset asynchrony) between visual and auditory stimuli on the integration phenomenon of audio-visual senses. Within the stimulus integration phenomenon, the redundant target effect (the faster and more accurate response to the target stimulus when the target stimulus is presented with more than two modalities) and the visual dominance effect (the faster and more accurate response to a visual stimulus compared to an auditory stimulus) were examined as we composed a visual and auditory unimodal target condition and a multimodal target condition and then observed the response time and accuracy. Consequently, despite the change between visual and auditory stimuli SOA, there was no redundant target effect present. The auditory dominance effect appeared when the SOA between the two stimuli was over 100ms. Theses results imply that the redundant target effect is continuously maintained even when the SOA between two modal stimuli is altered, and also suggests that the behavioral results of superior information processing can only be deducted when the time difference between the onset of the auditory stimuli and the visual stimuli is approximately over 100ms.

  • PDF

Comparison of McGurk Effect across Three Consonant-Vowel Combinations in Kannada

  • Devaraju, Dhatri S;U, Ajith Kumar;Maruthy, Santosh
    • Journal of Audiology & Otology
    • /
    • v.23 no.1
    • /
    • pp.39-48
    • /
    • 2019
  • Background and Objectives: The influence of visual stimulus on the auditory component in the perception of auditory-visual (AV) consonant-vowel syllables has been demonstrated in different languages. Inherent properties of unimodal stimuli are known to modulate AV integration. The present study investigated how the amount of McGurk effect (an outcome of AV integration) varies across three different consonant combinations in Kannada language. The importance of unimodal syllable identification on the amount of McGurk effect was also seen. Subjects and Methods: Twenty-eight individuals performed an AV identification task with ba/ga, pa/ka and ma/ṇa consonant combinations in AV congruent, AV incongruent (McGurk combination), audio alone and visual alone condition. Cluster analysis was performed using the identification scores for the incongruent stimuli, to classify the individuals into two groups; one with high and the other with low McGurk scores. The differences in the audio alone and visual alone scores between these groups were compared. Results: The results showed significantly higher McGurk scores for ma/ṇa compared to ba/ga and pa/ka combinations in both high and low McGurk score groups. No significant difference was noted between ba/ga and pa/ka combinations in either group. Identification of /ṇa/ presented in the visual alone condition correlated negatively with the higher McGurk scores. Conclusions: The results suggest that the final percept following the AV integration is not exclusively explained by the unimodal identification of the syllables. But there are other factors which may also contribute to making inferences about the final percept.

Comparison of McGurk Effect across Three Consonant-Vowel Combinations in Kannada

  • Devaraju, Dhatri S;U, Ajith Kumar;Maruthy, Santosh
    • Korean Journal of Audiology
    • /
    • v.23 no.1
    • /
    • pp.39-48
    • /
    • 2019
  • Background and Objectives: The influence of visual stimulus on the auditory component in the perception of auditory-visual (AV) consonant-vowel syllables has been demonstrated in different languages. Inherent properties of unimodal stimuli are known to modulate AV integration. The present study investigated how the amount of McGurk effect (an outcome of AV integration) varies across three different consonant combinations in Kannada language. The importance of unimodal syllable identification on the amount of McGurk effect was also seen. Subjects and Methods: Twenty-eight individuals performed an AV identification task with ba/ga, pa/ka and ma/ṇa consonant combinations in AV congruent, AV incongruent (McGurk combination), audio alone and visual alone condition. Cluster analysis was performed using the identification scores for the incongruent stimuli, to classify the individuals into two groups; one with high and the other with low McGurk scores. The differences in the audio alone and visual alone scores between these groups were compared. Results: The results showed significantly higher McGurk scores for ma/ṇa compared to ba/ga and pa/ka combinations in both high and low McGurk score groups. No significant difference was noted between ba/ga and pa/ka combinations in either group. Identification of /ṇa/ presented in the visual alone condition correlated negatively with the higher McGurk scores. Conclusions: The results suggest that the final percept following the AV integration is not exclusively explained by the unimodal identification of the syllables. But there are other factors which may also contribute to making inferences about the final percept.