[KSCI] Korea Science Citation Index Service

http://dx.doi.org/10.5302/J.ICROS.2007.13.8.719

Constructing a Noise-Robust Speech Recognition System using Acoustic and Visual Information

Lee, Jong-Seok (한국과학기술원 전자전산학부 전기 및 전자공학과)
Park, Cheol-Hoon (한국과학기술원 전자전산학부 전기 및 전자공학과)

Publication Information

Journal of Institute of Control, Robotics and Systems / v.13, no.8, 2007 , pp. 719-725 More about this Journal

Abstract

In this paper, we present an audio-visual speech recognition system for noise-robust human-computer interaction. Unlike usual speech recognition systems, our system utilizes the visual signal containing speakers' lip movements along with the acoustic signal to obtain robust speech recognition performance against environmental noise. The procedures of acoustic speech processing, visual speech processing, and audio-visual integration are described in detail. Experimental results demonstrate the constructed system significantly enhances the recognition performance in noisy circumstances compared to acoustic-only recognition by using the complementary nature of the two signals.

Keywords

audio-visual speech recognition; noise-robustness; integration;

Citations & Related Records

Reference

1	이종석, 심선희, 김소영. 박철훈, '제어되지 않은 조명 조건하에서 입술움직임의 강인한 특징추출을 이용한 바이모달 음성인식,' Telecommunications Review, 제 14 권, 제 1호, pp. 123-134, 2. 2004
2	T. W. Lewis and D. M. W. Powers, 'Sensor fusion weighting measures in audio-visual speech recognition,' in Proc. Conf. Australasian Computer Science, pp. 305-314, 2004
3	C. Bishop, Neural Networks for Pattern Recognition, Oxford University Press, UK, 1995
4	A. Varga and H. J. M. Steeneken, 'Assessment for automatic speech recognition: II. NOISEX-92: A database and an experiment to study the effect of additive noise on speech recognition systems,' Speech Communication, vol. 12, no. 3, pp. 247-251, 1993 DOI ScienceOn
5	L. A. Ross, D. Saint-Amour, V. M. Leavitt, D. C. Javitt, and J. J. Foxe, 'Do you see what I am saying? Exploring visual enhancement of speech comprehension in noisy environments,' Cerebral Cortex, vol. 17, no. 5, pp. 1147-1153, 2007 DOI ScienceOn
6	C. C. Chibelushi, F. Deravi, and J. S. D. Mason, 'A review of speech-based bimodal recognition,' IEEE Trans. Multimedia, vol. 4, no. 1, pp. 23-37, Mar. 2002 DOI ScienceOn
7	X.-D. Huang, A. Acero, and H.- W. Hon, Spoken Language Processing: A Guide to Theory, Algorithm, and System Development, Prentice Hall, 2001
8	P. Scanlon and R. Reilly, 'Feature analysis for automatic speechreading,' in Proc. Int. Conf. Multimedia and Expo, pp. 625-630, 2001
9	J.-S. Lee and C. H. Park, 'Training hidden Markov models by hybrid simulated annealing for visual speech recognition,' in Proc. Int. Conf. Systems, Man, Cybernetics, pp. 198-202, Oct. 2006
10	C. Benoit, 'The intrinsic bimodality of speech communication and the synthesis of talking faces,' The Structure of Multimodal Dialogue II, M. M. Taylor, F. Nel, and D. Bouwhuis, Eds. Amsterdam, The Netherlands: John Benjamins, pp. 485-202, 2000
11	R. C. Gonzalez and R. E. Woods, Digital Image Processing, Addison-Wesley Publishing Company, 2001
12	T. J. Hazen, 'Visual model structures and synchrony constraints for audio-visual speech recognition,' IEEE Trans. Audio, Speech, Language Processing, vol. 14, no. 3, pp. 1082-1089, May 2006 DOI ScienceOn
13	A. Verma, T. Faruquie, C. Neti, and S. Basu, 'Late integration in audio-visual continuous speech recognition,' in Proc. Workshop on Automatic Speech Recognition and Understanding, pp. 71-74, Dec. 1999
14	G. F. Meyer, J. B. Mulligan, and S. M. Wuerger, 'Continuous audio-visual digit recognition using N-best decision fusion,' Information Fusion, vol. 5, no. 2, pp. 91-101, June 2004 DOI ScienceOn
15	S. Tamura, K. Iwano, and S. Furui, 'A stream-weight optimization method for multi-stream HMMs based on likelihood value normalization,' in Proc. Int. Conf. Acoustics, Speech and Signal Processing, vol. 1, pp. 469-472, Mar. 2005

KSCI

Constructing a Noise-Robust Speech Recognition System using Acoustic and Visual Information 청각 및 시가 정보를 이용한 강인한 음성 인식 시스템의 구현

Constructing a Noise-Robust Speech Recognition System using Acoustic and Visual Information