Browse > Article
http://dx.doi.org/10.4218/etrij.11.1510.0158

A Weighted Feature Voting Approach for Robust and Real-Time Voice Activity Detection  

Moattar, Mohammad Hossein (Department of Computer Engineering and IT, Amirkabir University of Technology)
Homayounpour, Mohammad Mehdi (Department of Computer Engineering and IT, Amirkabir University of Technology)
Publication Information
ETRI Journal / v.33, no.1, 2011 , pp. 99-109 More about this Journal
Abstract
This paper concerns a robust real-time voice activity detection (VAD) approach which is easy to understand and implement. The proposed approach employs several short-term speech/nonspeech discriminating features in a voting paradigm to achieve a reliable performance in different environments. This paper mainly focuses on the performance improvement of a recently proposed approach which uses spectral peak valley difference (SPVD) as a feature for silence detection. The main issue of this paper is to apply a set of features with SPVD to improve the VAD robustness. The proposed approach uses a weighted voting scheme in order to take the discriminative power of the employed feature set into account. The experiments show that the proposed approach is more robust than the baseline approach from different points of view, including channel distortion and threshold selection. The proposed approach is also compared with some other VAD techniques for better confirmation of its achievements. Using the proposed weighted voting approach, the average VAD performance is increased to 89.29% for 5 different noise types and 8 SNR levels. The resulting performance is 13.79% higher than the approach based only on SPVD and even 2.25% higher than the not-weighted voting scheme.
Keywords
Spectral peak valley difference; spectral flatness; pitch; noise robustness; weighted voting; voice activity detection;
Citations & Related Records
Times Cited By KSCI : 1  (Citation Analysis)
Times Cited By Web Of Science : 2  (Related Records In Web of Science)
Times Cited By SCOPUS : 2
연도 인용수 순위
1 I.C. Yoo and D. Yook, "Robust Voice Activity Detection Using the Spectral Peaks of Vowel Sounds," ETRI J., vol. 31, no. 4, 2009, pp. 451-453   DOI
2 S. Shafiee et al., "A Two-Stage Speech Activity Detection System Considering Fractal Aspects of Prosody," Pattern Recog. Lett., 2010.
3 M. Fujimoto and K. Ishizuka, "Noise Robust Voice Activity Detection Based on Switching Kalman Filter," IEICE Trans. Inf. Syst., 2008, E91-D, pp. 467-477.   DOI   ScienceOn
4 A. Agarwal and Y.M. Cheng, "Two-Stage Mel-Warped Wiener Filter for Robust Speech Recognition," IEEE Workshop Auto. Speech Recog. Understanding, 1999, pp. 67-70.
5 B.F. Wu and K.C. Wang, "Robust Endpoint Detection Algorithm Based on the Adaptive Band Partitioning Spectral Entropy in Adverse Environments," IEEE Trans. Speech Audio Process., vol. 13, 2005, pp. 762-775.   DOI
6 J.L. Shen, J.W. Hung, and L.S. Lee, "Robust Entropy Based Endpoint Detection for Speech Recognition in Noisy Environments," ICSP, 1998, pp. 232-235.
7 S. Ahmadi and A.S. Spanias, "Cepstrum-Based Pitch Detection Using a New Statistical V/UV Classification Algorithm," IEEE Trans. Speech Audio Process., vol. 7, 1999, pp. 333-338.   DOI   ScienceOn
8 Y. Tian, Z. Wang, and D. Lu, "Non-Speech Segment Rejection Based on Prosodic Information for Robust Speech Recognition," IEEE Signal Process. Lett., vol. 9, no. 11, 2002, pp. 364-367.   DOI
9 K. Ishizuka et al., "Noise Robust Voice Activity Detection Based on Periodic to Aperiodic Component Ratio," Speech Commun., vol. 52, 2010, pp. 41-60.   DOI   ScienceOn
10 R.E. Yantorno, K.L. Krishnamachari, and J.M. Lovekin, "The Spectral Autocorrelation Peak Valley Ratio (SAPVR): A Usable Speech Measure Employed as a Co-channel Detection System," IEEE Int. Workshop Intell. Signal Process., 2001, pp. 193-197.
11 A. Benyassine et al., "ITU-T Recommendation G.729 Annex B: A Silence Compression Scheme for Use with G.729 Optimized for V.70 Digital Simultaneous Voice and Data Applications," IEEE Commun. Mag., vol. 35, 1997, pp. 64-73.
12 M. Marzinzik and B. Kollmeier, "Speech Pause Detection for Noise Spectrum Estimation by Tracking Power Envelope Dynamics," IEEE Trans. Speech Audio Process., vol. 10, 2002, pp. 109-118.   DOI   ScienceOn
13 J. Ram irez et al., "Efficient Voice Activity Detection Algorithms Using Long-Term Speech Information," Speech Commun., 2004, vol. 42, pp. 271-287.   DOI   ScienceOn
14 M.H. Moattar, M.M. Homayounpour, and N.K. Kalantari, "A New Approach for Robust Realtime Voice Activity Detection Using Spectral Pattern," ICASSP, 2010, pp. 4478-4481.
15 M.H. Savoji, "A Robust Algorithm for Accurate End Pointing of Speech," Speech Commun., 1989, vol. 8, no. 1, pp. 45-60.   DOI   ScienceOn
16 T. Kristjansson, S. Deligne, and P. Olsen, "Voicing Features for Robust Speech Detection," Interspeech, 2005, pp. 369-372.
17 A.P. Varga et al., "The NOISEX-92 Study on the Effect of Additive Noise on Automatic Speech Recognition," Technical report, DRA Speech Research Unit, 1992.
18 B. Lee and M. Hasegawa-Johnson, "Minimum Mean Squared Error A Posteriori Estimation of High Variance Vehicular Noise," Biennial DSP In-Vehicle Mobile Syst., 2007.
19 ETSI, Digital Cellular Telecommunications Systems (Phase 2+); Voice Activity Detector (VAD) for Adaptive Multi-Rate (AMR) Speech Traffic Channels, GSM 06.94, version 7.1.1, EN 301 708, 1999.
20 ETSI, Speech Processing, Transmission, and Quality Aspects (STQ), Distributed Speech Recognition, Advanced Front-End Feature Extraction Algorithm, Compression Algorithms, version 1.1.1, ES 202 050, 2001.
21 J.S. Garofalo et al., DARPA TIMIT Acoustic-Phonetic Continuous Speech Corpus CDROM, Linguistic Data Consortium, 1993.
22 M. Bijankhan and M.J. Sheikhzadegan, "FARSDAT- the Farsi Spoken Language Database," 5th Australian Int. Conf. Speech Sci. Technol., 1994, vol. 2, pp. 826-829.
23 M.H. Moattar and M.M. Homayounpour, "A Simple but Efficient Real-Time Voice Activity Detection Algorithm," Eusipco, 2009, pp. 2549-2553.
24 H.G. Hirsch and D. Pearce, "The AURORA Experimental Framework for the Performance Evaluation of Speech Recognition Systems under Noise Conditions," ISCA ITRW, 2000, pp. 181-188.
25 D. Cournapeau and T. Kawahara, "Evaluation of Real-Time Voice Activity Detection Based on High Order Statistics," Interspeech, 2007, pp. 2945-2949.
26 H. Kato Solvang, K. Ishizuka, and M. Fujimoto, "Voice Activity detection Based on Adjustable Linear Prediction and GARCH Models," Speech Commun., 2008, vol. 50, pp. 476-486.   DOI   ScienceOn