DOI QR코드

DOI QR Code

A New Temporal Filtering Method for Improved Automatic Lipreading

향상된 자동 독순을 위한 새로운 시간영역 필터링 기법

  • 이종석 (한국과학기술원 전자전산학부) ;
  • 박철훈 (한국과학기술원 전자전산학부)
  • Published : 2008.04.30

Abstract

Automatic lipreading is to recognize speech by observing the movement of a speaker's lips. It has received attention recently as a method of complementing performance degradation of acoustic speech recognition in acoustically noisy environments. One of the important issues in automatic lipreading is to define and extract salient features from the recorded images. In this paper, we propose a feature extraction method by using a new filtering technique for obtaining improved recognition performance. The proposed method eliminates frequency components which are too slow or too fast compared to the relevant speech information by applying a band-pass filter to the temporal trajectory of each pixel in the images containing the lip region and, then, features are extracted by principal component analysis. We show that the proposed method produces improved performance in both clean and visually noisy conditions via speaker-independent recognition experiments.

자동 독순(automatic lipreading)은 화자의 입술 움직임을 통해 음성을 인식하는 기술이다. 이 기술은 잡음이 존재하는 환경에서 말소리를 이용한 음성인식의 성능 저하를 보완하는 수단으로 최근 주목받고 있다. 자동 독순에서 중요한 문제 중 하나는 기록된 영상으로부터 인식에 적합한 특징을 정의하고 추출하는 것이다. 본 논문에서는 독순 성능의 향상을 위해 새로운 필터링 기법을 이용한 특징추출 기법을 제안한다. 제안하는 기법에서는 입술영역 영상에서 각 픽셀값의 시간 궤적에 대역통과필터를 적용하여 음성 정보와 관련이 없는 성분, 즉 지나치게 높거나 낮은 주파수 성분을 제거한 후 주성분분석으로 특징을 추출한다. 화자독립 인식 실험을 통해 영상에 잡음이 존재하는 환경이나 존재하지 않는 환경에서 모두 향상된 인식 성능을 얻음을 보인다.

Keywords

References

  1. C. C. Chibelushi, F. Deravi, J. S. D. Mason, “A review of speech-based bimodal recognition,” IEEE. Trans. Multimedia, Vol. 4, No. 1, pp. 23-37, 2002 https://doi.org/10.1109/6046.985551
  2. H. Yao, W. Gao, W. Shan, and M. Xu, "Visual features extracting and selecting for lipreading," in Proc. Int. Conf. Audio- and Video-based Biometric Person Authentication, Guildford, UK, pp. 251-259, Jun. 2003
  3. 이종석, 심선희, 김소영, 박철훈, “제어되지 않은 조명 조건하 에서 입술 움직임의 강인한 특징추출을 이용한 바이모달 음성 인식,” Telecommunications Review, 제14권 제1호, pp. 123-134, 2004년 2월
  4. C. Bregler and Y. Konig, “Eigenlips for robust speech recognition,” in Proc. Int. Conf. Acoustics, Speech and Signal Processing, Vol. 2, Adelaide, Austria, pp. 669-672, 1994
  5. G. Potamianos, A. Verma, C. Neti, G. Iyengar, and S. Basu, “A cascade image transform for speaker independent automatic speechreading,” in Proc. Int. Conf. Multimedia and Expo, Vol. 2, New York, pp. 1097-1100, 2000
  6. P. Scanlon and R. Reilly, “Feature analysis for automatic speechreading,” in Proc. Int. Conf. Multimedia and Expo, Tokyo, Japan, pp. 625-630, Apr. 2001
  7. G. Potamianos and C. Neti, “Audio-visual speech recognition in challenging environments,” in Proc. Eurospeech, Geneva, Switzerland, pp. 1293-1296, Sep. 2003
  8. K. Saenko, T. Darrell, J. Glass, “Articulatory features for robust visual speech recognition,” in Proc. Int. Conf. Multimodal Interfaces, State College, PA, pp. 152-158, Oct. 2004
  9. A. Amer and E. Dubois, “Fast and reliable structure-oriented video noise estimation,” IEEE. Trans. Circuits and Systems for Video Technology, Vol. 15, No. 1, pp. 113-118, Jan. 2005 https://doi.org/10.1109/TCSVT.2004.837017
  10. 이종석, 박철훈, “시청각 음성인식을 위한 정보통합: 신뢰도 측정방식의 비교와 신경회로망을 이용한 통합 기법,” Telecommunications Review, 제17권 제3호, pp. 538-550, 2007년 6월
  11. J.-S. Lee and C. H. Park, “Training hidden Markov models by hybrid simulated annealing for visual speech recognition,” in Proc. Int. Conf. Systems, Man, and Cybernetics, pp. 198-202, Taipei, Taiwan, Oct. 2006
  12. R. C. Gonzalez and R. E. Woods, 'Digital Image Processing,' Prentice-Hall, Upper Saddle River, NJ, 2001
  13. S. Lucey, “An evaluation of visual speech features for the tasks of speech and speaker recognition,” in Proc. Int. Conf. Audio- and Video-based Biometric Person Authentication, Guildford, UK, pp. 260-267, Jun. 2003
  14. X. Huang, A. Acero, and H.-W. Hon, “Spoken Language Processing,” Prentice-Hall, Upper Saddle River, NJ, 2001
  15. J. J. Ohala, “The temporal regulation of speech,” in Auditory Analysis and Perception, eds., G. Fant and M. A. Tatham, Academic Press, London, UK, pp. 431-453, 1975
  16. K. Munhall and E. Vatikiotis-Bateson, “The moving face during speech communication,” in Hearing by Eye II: Advances in the Psychology of Speechreading and Audio-Visual Speech, eds., R. Campbell, B. Dodd, and D. Burnham, Psychology Press, Hove, UK, pp. 123-142, 1998
  17. J. G. Proakis and D. G. Manolakis, 'Digital signal processing,' Prentice-Hall, Upper Saddle River, NJ, 1996
  18. M. Vitkovitch and P. Barber, “Visible speech as a function of image quality: effects of display parameters on lipreading ability,” Applied Cognitive Psychology, Vol. 10, pp. 121-140, 1996 https://doi.org/10.1002/(SICI)1099-0720(199604)10:2<121::AID-ACP371>3.0.CO;2-V

Cited by

  1. Highly Reliable Fault Detection and Classification Algorithm for Induction Motors vol.18B, pp.3, 2011, https://doi.org/10.3745/KIPSTB.2011.18B.3.147