DOI QR코드

DOI QR Code

Visual Voice Activity Detection and Adaptive Threshold Estimation for Speech Recognition

음성인식기 성능 향상을 위한 영상기반 음성구간 검출 및 적응적 문턱값 추정

  • 송태엽 (고려대학교 바이오마이크로시스템기술 협동과정) ;
  • 이경선 (고려대학교 전기전자전파공학부) ;
  • 김성수 (삼성전자) ;
  • 이재원 (삼성전자) ;
  • 고한석 (고려대학교 전기전자전파공학부)
  • Received : 2015.04.16
  • Accepted : 2015.05.28
  • Published : 2015.07.31

Abstract

In this paper, we propose an algorithm for achieving robust Visual Voice Activity Detection (VVAD) for enhanced speech recognition. In conventional VVAD algorithms, the motion of lip region is found by applying an optical flow or Chaos inspired measures for detecting visual speech frames. The optical flow-based VVAD is difficult to be adopted to driving scenarios due to its computational complexity. While invariant to illumination changes, Chaos theory based VVAD method is sensitive to motion translations caused by driver's head movements. The proposed Local Variance Histogram (LVH) is robust to the pixel intensity changes from both illumination change and translation change. Hence, for improved performance in environmental changes, we adopt the novel threshold estimation using total variance change. In the experimental results, the proposed VVAD algorithm achieves robustness in various driving situations.

본 연구에서는 음성인식기 성능향상을 위한 영상기반 음성구간 검출방법을 제안한다. 기존의 광류기반 방법은 조도변화에 대응하지 못하고 연산량이 많아서 이동형 플렛홈에 적용되는 스마트 기기에 적용하는데 어려움이 있고, 카오스 이론 기반 방법은 조도변화에 강인하지만 차량 움직임 및 입술 검출의 부정확성으로 인해 발생하는 오검출이 발생하는 문제점이 있다. 본 연구에서는 기존 영상기반 음성구간 검출 알고리즘의 문제점을 해결하기 위해 지역 분산 히스토그램(Local Variance Histogram, LVH)과 적응적 문턱값 추정 방법을 이용한 음성구간 검출 알고리즘을 제안한다. 제안된 방법은 조도 변화에 따른 픽셀 변화에 강인하고 연산속도가 빠르며 적응적 문턱값을 사용하여 조도변화 및 움직임이 큰 차량 운전자의 발화를 강인하게 검출할 수 있다. 이동중인 차량에서 촬영한 운전자의 동영상을 이용하여 성능을 측정한 결과 제안한 방법이 기존의 방법에 비하여 성능이 우수함을 확인하였다.

Keywords

References

  1. J. Park, W. Kim, D. K. Han, and H. Ko, "Voice activity detection in noisy environments based on double-combined fourier transform and line fitting," J. The Scientific World Journal 2014, 1-11 (2014).
  2. S. Lee, J. Park, Y. Lee, and E. Kim, "Speech activity decision with lip movement image signal" (in Korean), J. Acoust. Soc. Kr. 26, 25-31, (2007).
  3. A. J. Aubrey, Y. A. Hicks, and J. A. Chambers, "Visual voice activity detection with optical flow," J. IET Image Processing 4, 463-472, (2010). https://doi.org/10.1049/iet-ipr.2009.0042
  4. P. Tiawongsombat, M. Jeong, J. Yun, B. You, and S. Oh, "Robust visual speakingness detection using bi-level HMM," J. Pattern Recognition 45, 783-793 (2012). https://doi.org/10.1016/j.patcog.2011.07.011
  5. S. Takeuchi, T. Hashiba, S. Tamura, and S. Hayamizu, "Voice activity detection based on fusion of audio and visual information," in Proc. International Conference on Audio-Visual Speech Processing, 151-154 (2009).
  6. K. Lee and H. Ko, "Visual voice activity detection using lip motion and direction in vehicle environment" (in Korean), in Proc. IEEK Fall Conference, 646-647 (2013).
  7. G. Kim, J. Ryu, and N. Cho, "Voice activity detection using motion and variation of intensity in the mouth region" (in Korean), J. Broadcast Engineering 17, 519-528 (2012). https://doi.org/10.5909/JBE.2012.17.3.519
  8. T. Song, K. Lee, and H. Ko, "Robust visual voice activity detection using chaos theory under illumination varying environment," in Proc. IEEE International Conference on Consumer Electronics, 574-575 (2014).
  9. T. Song, K. Lee, and H. Ko, "Visual voice activity detection via chaos based lip motion measure robust under illumination changes," J. IEEE Transactions on Consumer Electronics 60, 251-257 (2014). https://doi.org/10.1109/TCE.2014.6852001
  10. K. Lee, T. Song, S. Kim, D. K. Han, and H. Ko, "Robust visual voice activity detection using local variance histogram in vehicular environments," in Proc. IEEE International Conference on Consumer Electronics, 476-477 (2015).
  11. E. Zheng, X, Ping, T. Zhang, G, Xiong, "Steganalysis of LSB matching based on local variance histogram," in Proc. IEEE International Conference on Image Processing, 1005-1008 (2010).
  12. B Froba, A Ernst, "Face detection with the modified census transform," in Proc. IEEE International Conference on Automatic Face and Gesture Recognition, 91-96, (2004).
  13. J. Beh, R. Baran, and H. Ko, "Dual channel based speech enhancement using novelty filter for robust speech recognition in automobile environment," J. IEEE Transaction Consumer Electronics 52, 583-589 (2006). https://doi.org/10.1109/TCE.2006.1649683
  14. N. Otsu, "A threshold selection method from gray-level histograms," J. IEEE Transactions on Systems, Man and Cybernetics 9, 62-66 (1979). https://doi.org/10.1109/TSMC.1979.4310076