[KSCI] Korea Science Citation Index Service

http://dx.doi.org/10.7776/ASK.2009.28.8.832

A Novel Integration Scheme for Audio Visual Speech Recognition

Pham, Than Trung (School of Electronics & Computer Engineering Chonnam National University)
Kim, Jin-Young (School of Electronics & Computer Engineering Chonnam National University)
Na, Seung-You (School of Electronics & Computer Engineering Chonnam National University)

Publication Information

The Journal of the Acoustical Society of Korea / v.28, no.8, 2009 , pp. 832-842 More about this Journal

Abstract

Automatic speech recognition (ASR) has been successfully applied to many real human computer interaction (HCI) applications; however, its performance tends to be significantly decreased under noisy environments. The invention of audio visual speech recognition (AVSR) using an acoustic signal and lip motion has recently attracted more attention due to its noise-robustness characteristic. In this paper, we describe our novel integration scheme for AVSR based on a late integration approach. Firstly, we introduce the robust reliability measurement for audio and visual modalities using model based information and signal based information. The model based sources measure the confusability of vocabulary while the signal is used to estimate the noise level. Secondly, the output probabilities of audio and visual speech recognizers are normalized respectively before applying the final integration step using normalized output space and estimated weights. We evaluate the performance of our proposed method via Korean isolated word recognition system. The experimental results demonstrate the effectiveness and feasibility of our proposed system compared to the conventional systems.

Keywords

Audio Visual Speech Recognition; Reliability; Late Integration; Hidden Markov Model;

Citations & Related Records

Reference

1	T. Chen, "Audiovisual speech processing," IEEE Signal Processing Magazine, vol. 18, no. 1, pp. 9-21, 2001 DOI ScienceOn
2	J.-S. Lee and C. H. Park, "Adaptive Decision Fusion for Audio-Visual Speech Recognition," in Speech Recognition, Technology and Applications, I-Tech, Vienna, Austria, pp. 275-296, 2008
3	Printz et al., “Theory and Practice of Acoustic Confusability”, Proceedings of the ISCA ITRW ASR2000, pp. 77-84, Paris, France, Sep. 18-20, 2000 DOI ScienceOn
4	Jia-Yu Chen, Peder Olsen, and John Hershey, "Word Confusability - Measuring Hidden Markov Model Similarity," Proceedings of Interspeech 2007 pp. 2089-2092, August 2007
5	http://www.speech.cs.cmu.edu/comp.speech/Section1/Data/noisex.html
6	John R. Hershey, Peder A. Olsen, and Steven J. Rennie, "Variational Kullback Leibler Divergence for Hidden Markov Models," Proceedings of ASRU, Kyoto, Japan, pp. 323-328, December 2007. DOI
7	J. Silva and S. Narayanan, "Average Divergence Distance as a Statistical Discrimination Measure for Hidden Markov Models," IEEE Transactions on Audio, Speech and Language Processing, vol. 14, issue 3, pp. 890-906, May 2006 DOI ScienceOn
8	J. Kennedy, and R. Eberhart, "Particle Swarm Optimization," Proceedings of the IEEE Int. Conf. on Neural Networks, Piscataway, NJ, pp. 1942 - 1948, 1995
9	John Hershey and Peder Olsen, “Approximating the Kullback LeibIer divergence between gaussian mixture models,” Proceedings of ICASSP 2007, Honolulu, Hawaii, April 2007
10	Nefian, L. Laing, X. Pi, L. Xioxiang, C. Mao and K. Murphy, "Dynamic Bayesian Networks for Audio-Visual Speech Recognition," EURASIP Journal on Applied Signal Processing, vol. 1, pp. 1274 - 1288, 2002 DOI ScienceOn
11	Petajan, E.D., "Automatic Lipreading to Enhance Speech Recognition," Proceedings of IEEE Conf. on Computer Vision and Pattern Recognition, pp. 40-47, 1985
12	F. Berthommier, H, Glotin, "A new SNR-feature mapping for robust multistream speech recognition," Proceedings of International Congress on Phonetic Sciences (ICPhS), vol. 1, pp. 711-715, San Francisco, 1999
13	H. Glotin, D. Vergyri, C. Neti, G. Potamianos, and J. Luettin, "Weighting schemes for audio-visual fusion in speech recognition," Proceedings of IEEE Int. Conf. Acoust., Speech, Signal Process., vol. 1, pp. 173 - 176, 2001 DOI
14	P. Duchnowski, U. Meier, A. Waibel, "See Me, Hear Me: lntergrating Automatic Speech Recognition and Lipreading", Proceedings of ICSLP pp. 547-550, 1994
15	A. Bhattacharyya, “On a Measure of Divergence between Two Statistical Populations Defined by Probability Distributions,” Bull. Calcutta Math. Soc., vol. 35, pp. 99 - 109, 1943
16	A. Rogozan, P. Del'eglise, and M. Alissali, “Adaptive determination of audio and visual weights for automatic speech recognition,” Proceedings of European Tutorial Workshop on Audio-Visual Speech Processing (AVSP), pp. 61 - 64, 1997
17	J.R. Hershey, P.A. Olsen, "Variational Bhattacharyya Divergence for Hidden Markov Models", Proceedings of ICASSP 2008, pp. 4557-4560, 2008
18	Md. J. Alam, Md. F. Chowdhury, Md. F. Alam, "Comparative Study of A Priori Signal-To Noise Ratio (SNR) Estimation Approaches for Speech Enhancement", Journal of Electrical & Electronics Engineering, vol. 9, no. 1, pp. 809-817, 2009
19	M. Heckmann, F. Berthommier and K. Kroschel, "Noise Adaptive Stream Weighting in Audio-Visual Speech Recognitions," EURASIP Journal on Applied Signal Processing, vol. 2002, no. 1, pp. 1260 - 1273, 2002 DOI ScienceOn
20	G. Potamianos, C. Neti, J. Luettin, and I. Matthews, “Audio-Visual Automatic Speech Recognition: An Overview,” in Issues in Visual and Audio-Visual Speech Processing, G. Bailly, E. Vatikiotis-Bateson, and P. Perrier (Eds.), MIT Press, Boston, 2004
21	M. Gurban and J.-Ph. Thiran, " Using Entropy as a Stream Reliability Estimate for Audio-Visual Speech Recognition," Proceedings of 16th European Signal Processing Conference, Lausanne, Switzerland, August pp. 25-29, 2008
22	Kuliback, S; Leibler, R.A, "On Information and Sufficiency," The Annals of Mathematical Statistics, vol. 22 (1): pp. 79 - 86, 1951 DOI