DOI QR코드

DOI QR Code

입술 영역의 움직임과 밝기 변화를 이용한 음성구간 검출 알고리즘 개발

Voice Activity Detection using Motion and Variation of Intensity in The Mouth Region

  • 김기백 (숭실대학교 전기공학부) ;
  • 유제웅 (서울대학교 전기컴퓨터공학부) ;
  • 조남익 (서울대학교 전기컴퓨터공학부)
  • Kim, Gi-Bak (School of Electrical Engineering, Soongsil University) ;
  • Ryu, Je-Woong (Department of Electrical Engineering and Computer Science, Seoul National University) ;
  • Cho, Nam-Ik (Department of Electrical Engineering and Computer Science, Seoul National University)
  • 투고 : 2012.03.28
  • 심사 : 2012.04.27
  • 발행 : 2012.05.30

초록

음성구간을 검출하는 일반적인 방법은 음향신호로부터 특징값을 추출하여 판별식을 거치는 것이다. 그러나 잡음이 많은 환경에서 그 성능은 당연히 저하되며, 이 경우 영상신호를 이용하거나 영상과 음성을 동시에 사용함으로써 성능향상을 도모할 수 있다. 영상신호를 이용하여 음성구간을 검출하는 기존 방법들에서는 액티브 어피어런스 모델, 옵티컬 플로우, 밝기 변화 등 주로 하나의 특징값을 이용하고 있다. 그러나 음성구간의 참값은 음향신호에 의해 결정되므로 한 가지의 영상정보만으로는 음성구간을 검출하는데 한계를 보이고 있다. 본 논문에서는 입술 영역의 옵티컬 플로우와 밝기 변화 두 가지 영상정보로부터 특징값을 추출하고, 추출된 특징값들을 결합하여 음성구간을 검출하는 알고리즘을 제안하고자 한다. 또한, 음성구간 검출 알고리즘이 다른 시스템의 전처리로 활용되는 경우에 적은 계산량만으로 수행되는 것이 바람직하므로, 통계적 모델링에 의한 방법보다는 추출된 특징값으로부터 간단한 대수적 연산만으로 스코어를 산정하여 문턱값과 비교하는 방법을 제안하고자 한다. 입술 영역 검출을 위해서는 얼굴에서 가장 두드러진 특징점을 갖는 눈을 먼저 검출한 후, 얼굴의 구조와 밝기값을 이용하는 알고리즘을 제안하였다. 실험 결과 본 논문에서 제안하는 두 가지 특징값을 결합한 음성구간 검출 알고리즘이 하나의 특징값만을 이용했을 때보다 우수한 성능을 보임을 확인할 수 있다.

Voice activity detection (VAD) is generally conducted by extracting features from the acoustic signal and a decision rule. The performance of such VAD algorithms driven by the input acoustic signal highly depends on the acoustic noise. When video signals are available as well, the performance of VAD can be enhanced by using the visual information which is not affected by the acoustic noise. Previous visual VAD algorithms usually use single visual feature to detect the lip activity, such as active appearance models, optical flow or intensity variation. Based on the analysis of the weakness of each feature, we propose to combine intensity change measure and the optical flow in the mouth region, which can compensate for each other's weakness. In order to minimize the computational complexity, we develop simple measures that avoid statistical estimation or modeling. Specifically, the optical flow is the averaged motion vector of some grid regions and the intensity variation is detected by simple thresholding. To extract the mouth region, we propose a simple algorithm which first detects two eyes and uses the profile of intensity to detect the center of mouth. Experiments show that the proposed combination of two simple measures show higher detection rates for the given false positive rate than the methods that use a single feature.

키워드

참고문헌

  1. J. Sohn, N. S. Kim and W. Sung, "A statistical model-based voice activity detection," IEEE Signal Processing Letters, vol. 6, no. 1, pp. 1-3, January 1999.
  2. M. Hoffman, Z. Li, and D. Khataniar, "GSC-based spatial voice activity detection for enhanced speech coding in the presence of competing speech," IEEE Trans. on Speech and Audio Processing, vol. 9, no. 2, pp. 175-179, March 2001. https://doi.org/10.1109/89.902284
  3. S. F. Boll, "Suppression of acoustic noise in speech using spectral subtraction," IEEE Trans. Acoust., Speech, and Signal Processing, vol. ASSP-27, no. 2, pp. 113-120, April, 1979.
  4. L. F. Lamel, L. R. Rabiner, A. E. Rosenberg, and J. G. Wilpon, "An improved endpoint detector for isolated word recognition,", IEEE Trans. Acoust., Speech, and Signal Processing, vol. ASSP-29, no. 4, pp. 777-785, August 1981.
  5. B.-F. Wu, "Robust endpoint detection algorithm based on the adaptive band-partitioning spectral entropy in adverse environments," IEEE Trans. Speech and Audio Processing, vol. 13, no. 5, pp. 762-775, September 2005. https://doi.org/10.1109/TSA.2005.851909
  6. L. Armani, M. Matassoni, M. Omologo, and P. Svaizer, "Use of a CSP-based voice activity detector for distant-talking ASR," in Proceedings of EUROSPEECH, Geneva, 2003.
  7. G. Kim and N. I. Cho, "Voice activity detection using phase vector in microphone array," Electronics Letters, vol. 43, issue 14, pp. 783-784, July 2007. https://doi.org/10.1049/el:20070780
  8. H. Yehia, R. Rubin, and E. Vatikiotis-Bateson, "Quantitative association of vocal-tract and facial behavior," Speech Communication, vol. 26, no. 1, pp. 23-43, August 1998. https://doi.org/10.1016/S0167-6393(98)00048-X
  9. P. Liu and Z. Wang, "Voice activity detection using visual information," in Proceedings of ICASSP, pp. 609-612, Montreal, Canada, May 2004.
  10. T. Cootes, G. Edwards, and C. Taylor, "Active appearance models," IEEE trans. Pattern Analysis and Machine Intelligence, vol. 23, no. 6, pp. 681-685, June 2001. https://doi.org/10.1109/34.927467
  11. A. Aubrey, B. Rivet, Y. Hicks, L. Girin, L. Chambers, and C. Jutten, "Two novel visual voice activity detectors based on appearance models and retinal filtering," in Proceedings of EUSIPCO, September 2007.
  12. S. Siatras, N. Nikolaidis, M. Krinidis, and I. Pitas, "Visual lip activity detection and speaker detection using mouth region intensities," IEEE trans. Circuits and Systems for Video Technology, vol. 19, no. 1, pp. 133-137, January 2009. https://doi.org/10.1109/TCSVT.2008.2009262
  13. R. Navarathna, D. Dean, P. Lucey, S. Sridharan, and C. Fookes, "Cascading appearance-based features for visual voice activity detection," in Proceedings of International Conference on Audio-Visual Speech Processing, Hakone, Japan, September 2010.
  14. A. Aubrey, Y. Hicks, and J. Chambers, "Visual voice activity detection with optical flow," Image Processing, IET, vol. 4, no. 4, pp. 463-472, December 2010. https://doi.org/10.1049/iet-ipr.2009.0042
  15. S. Tamura, K. Iwano, and S. Furui, "Multi-modal speech recognition using optical-flow analysis for lip images," J. VLSI Signal Process. Syst., vol. 36, pp. 117-124, February 2004. https://doi.org/10.1023/B:VLSI.0000015091.47302.07
  16. D. Sun, S. Roth, and M. Black, "Secrets of optical flow estimation and their principles," In Proceedings of IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pp. 2432-2439, San Francisco, USA, June 2010.
  17. B. Lucas and T. Kanade, "An iterative image registration technique with an application to stereo vision," in Proceedings of the 7th International Joint Conference on Artificial Intelligence (IJCAI), pp. 674-679, April 1981.
  18. R. Navarathna, D. Dean, S. Sridharan, C. Fookes, and P. Lucey, "Visual voice activity detection using frontal versus profile views," In Proceedings of the International Conference on Digital Image Computing : Techniques and Applications, December 2011.
  19. P. Viola and M. Jones, "Robust Real-time Object Detection", Second International Workshop on Statistical and Computational Theories of Vision-Modeling, Learning, Computing, and Sampling, Vancouver, Canada, July 2001.
  20. Y. Freund and R.E. Schapire "A decision-theoretic generalization of on-line learning and an application to boosting", In Computational Learning Theory: Eurocolt, Springer-Verlag, pp. 23-37, 1995
  21. E. Skodras and N. Fakotakis, "An Unconstrained Method for Lip Detection in Color Images", in Proceedings of ICASSP, Prague, Czech, 2011.
  22. G. Fanelli and J. Gall and L. Van Gool, "Hough Transform-based Mouth Localization for Audio-Visual Speech Recognition", British Machine Vision Conference, 2009
  23. X. Liu, Y. Cheung M. Li and H. Liu, "A Lip Contour Extraction Method Using Localized Active Contour Model with Automatic Parameter Selection", 20th Int. Conf. on Pattern Recognition (ICPR), August 2010.

피인용 문헌

  1. Visual Voice Activity Detection and Adaptive Threshold Estimation for Speech Recognition vol.34, pp.4, 2015, https://doi.org/10.7776/ASK.2015.34.4.321