Robust Feature Extraction for Voice Activity Detection in Nonstationary Noisy Environments

음성구간검출을 위한 비정상성 잡음에 강인한 특징 추출

  • 홍정표 (한국과학기술원, 전기 및 전자공학과) ;
  • 박상준 (한국과학기술원, 전기 및 전자공학과) ;
  • 정상배 (경상대학교, 전자공학과(공학연구원)) ;
  • 한민수 (한국과학기술원, 전기 및 전자공학과)
  • Received : 2012.11.06
  • Accepted : 2012.03.13
  • Published : 2013.03.31


This paper proposes robust feature extraction for accurate voice activity detection (VAD). VAD is one of the principal modules for speech signal processing such as speech codec, speech enhancement, and speech recognition. Noisy environments contain nonstationary noises causing the accuracy of the VAD to drastically decline because the fluctuation of features in the noise intervals results in increased false alarm rates. In this paper, in order to improve the VAD performance, harmonic-weighted energy is proposed. This feature extraction method focuses on voiced speech intervals and weighted harmonic-to-noise ratios to determine the amount of the harmonicity to frame energy. For performance evaluation, the receiver operating characteristic curves and equal error rate are measured.



  1. Rabiner, L.R. (1975). An algorithm for determining the endpoints of isolated utterances. The Bell System Technical Journal, Vol. 54, No. 2, 297-315.
  2. Zoltan, T. (2005). Robust voice activity detection based on the entropy of noise-suppressed spectrum. Interspeech, 245-248.
  3. Ouzounov, A. (2004). A robust feature for speech detection. Cybernetics and information technologies, Vol. 4, No. 2, 3-14.
  4. Kondoz, A.M. (1994). Digital speech: coding for low bit rate communication system. UK: John Wiley & Sons.
  5. Rabiner, L.R. (1978). Digital processing of speech signals. USA: Prentice-Hall.
  6. Jeong, S. (2001). Speech quality and recognition rate improvement in car noise environments. Electronics Letters, Vol. 37, No. 12, 801-802.
  7. ETSI Std. (2005). Speech processing, transmission and quality aspects (STQ); distributed speech recognition; extended advanced front-end feature extraction algorithm; compression algorithm; back-end speech reconstruction algorithm. ES 202212 V1.1.2.
  8. Brandstein, M. (2001). Microphone arrays: signal processing techniques and applications. Berlin: Springer.
  9. Qi, Y. (1997). Temporal and spectral estimations of harmonics-to-noise ratio in human voice signals. Journal of Acoustical Society of America. Vol. 102, No. 1, 537-543.
  10. Hirsch, H. (2000). The aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions. ISCA ITRW ASR2000.