DOI QR코드

DOI QR Code

A Novel Integration Scheme for Audio Visual Speech Recognition

  • Pham, Than Trung (School of Electronics & Computer Engineering Chonnam National University) ;
  • Kim, Jin-Young (School of Electronics & Computer Engineering Chonnam National University) ;
  • Na, Seung-You (School of Electronics & Computer Engineering Chonnam National University)
  • Published : 2009.11.30

Abstract

Automatic speech recognition (ASR) has been successfully applied to many real human computer interaction (HCI) applications; however, its performance tends to be significantly decreased under noisy environments. The invention of audio visual speech recognition (AVSR) using an acoustic signal and lip motion has recently attracted more attention due to its noise-robustness characteristic. In this paper, we describe our novel integration scheme for AVSR based on a late integration approach. Firstly, we introduce the robust reliability measurement for audio and visual modalities using model based information and signal based information. The model based sources measure the confusability of vocabulary while the signal is used to estimate the noise level. Secondly, the output probabilities of audio and visual speech recognizers are normalized respectively before applying the final integration step using normalized output space and estimated weights. We evaluate the performance of our proposed method via Korean isolated word recognition system. The experimental results demonstrate the effectiveness and feasibility of our proposed system compared to the conventional systems.

Keywords

References

  1. Nefian, L. Laing, X. Pi, L. Xioxiang, C. Mao and K. Murphy, "Dynamic Bayesian Networks for Audio-Visual Speech Recognition," EURASIP Journal on Applied Signal Processing, vol. 1, pp. 1274 - 1288, 2002 https://doi.org/10.1155/S1110865702206083
  2. Petajan, E.D., "Automatic Lipreading to Enhance Speech Recognition," Proceedings of IEEE Conf. on Computer Vision and Pattern Recognition, pp. 40-47, 1985
  3. T. Chen, "Audiovisual speech processing," IEEE Signal Processing Magazine, vol. 18, no. 1, pp. 9-21, 2001 https://doi.org/10.1109/79.911195
  4. P. Duchnowski, U. Meier, A. Waibel, "See Me, Hear Me: lntergrating Automatic Speech Recognition and Lipreading", Proceedings of ICSLP pp. 547-550, 1994
  5. G. Potamianos, C. Neti, J. Luettin, and I. Matthews, “Audio-Visual Automatic Speech Recognition: An Overview,” in Issues in Visual and Audio-Visual Speech Processing, G. Bailly, E. Vatikiotis-Bateson, and P. Perrier (Eds.), MIT Press, Boston, 2004
  6. F. Berthommier, H, Glotin, "A new SNR-feature mapping for robust multistream speech recognition," Proceedings of International Congress on Phonetic Sciences (ICPhS), vol. 1, pp. 711-715, San Francisco, 1999
  7. Md. J. Alam, Md. F. Chowdhury, Md. F. Alam, "Comparative Study of A Priori Signal-To Noise Ratio (SNR) Estimation Approaches for Speech Enhancement", Journal of Electrical & Electronics Engineering, vol. 9, no. 1, pp. 809-817, 2009
  8. A. Rogozan, P. Del'eglise, and M. Alissali, “Adaptive determination of audio and visual weights for automatic speech recognition,” Proceedings of European Tutorial Workshop on Audio-Visual Speech Processing (AVSP), pp. 61 - 64, 1997
  9. H. Glotin, D. Vergyri, C. Neti, G. Potamianos, and J. Luettin, "Weighting schemes for audio-visual fusion in speech recognition," Proceedings of IEEE Int. Conf. Acoust., Speech, Signal Process., vol. 1, pp. 173 - 176, 2001 https://doi.org/10.1109/ICASSP.2001.940795
  10. M. Heckmann, F. Berthommier and K. Kroschel, "Noise Adaptive Stream Weighting in Audio-Visual Speech Recognitions," EURASIP Journal on Applied Signal Processing, vol. 2002, no. 1, pp. 1260 - 1273, 2002 https://doi.org/10.1155/S1110865702206150
  11. M. Gurban and J.-Ph. Thiran, " Using Entropy as a Stream Reliability Estimate for Audio-Visual Speech Recognition," Proceedings of 16th European Signal Processing Conference, Lausanne, Switzerland, August pp. 25-29, 2008
  12. J.-S. Lee and C. H. Park, "Adaptive Decision Fusion for Audio-Visual Speech Recognition," in Speech Recognition, Technology and Applications, I-Tech, Vienna, Austria, pp. 275-296, 2008
  13. J. Kennedy, and R. Eberhart, "Particle Swarm Optimization," Proceedings of the IEEE Int. Conf. on Neural Networks, Piscataway, NJ, pp. 1942 - 1948, 1995
  14. Kuliback, S; Leibler, R.A, "On Information and Sufficiency," The Annals of Mathematical Statistics, vol. 22 (1): pp. 79 - 86, 1951 https://doi.org/10.1214/aoms/1177729694
  15. A. Bhattacharyya, “On a Measure of Divergence between Two Statistical Populations Defined by Probability Distributions,” Bull. Calcutta Math. Soc., vol. 35, pp. 99 - 109, 1943
  16. Printz et al., “Theory and Practice of Acoustic Confusability”, Proceedings of the ISCA ITRW ASR2000, pp. 77-84, Paris, France, Sep. 18-20, 2000 https://doi.org/10.1006/csla.2001.0188
  17. John Hershey and Peder Olsen, “Approximating the Kullback LeibIer divergence between gaussian mixture models,” Proceedings of ICASSP 2007, Honolulu, Hawaii, April 2007
  18. J.R. Hershey, P.A. Olsen, "Variational Bhattacharyya Divergence for Hidden Markov Models", Proceedings of ICASSP 2008, pp. 4557-4560, 2008
  19. John R. Hershey, Peder A. Olsen, and Steven J. Rennie, "Variational Kullback Leibler Divergence for Hidden Markov Models," Proceedings of ASRU, Kyoto, Japan, pp. 323-328, December 2007. https://doi.org/10.1109/ASRU.2007.4430132
  20. Jia-Yu Chen, Peder Olsen, and John Hershey, "Word Confusability - Measuring Hidden Markov Model Similarity," Proceedings of Interspeech 2007 pp. 2089-2092, August 2007
  21. J. Silva and S. Narayanan, "Average Divergence Distance as a Statistical Discrimination Measure for Hidden Markov Models," IEEE Transactions on Audio, Speech and Language Processing, vol. 14, issue 3, pp. 890-906, May 2006 https://doi.org/10.1109/TSA.2005.858059
  22. http://www.speech.cs.cmu.edu/comp.speech/Section1/Data/noisex.html