• Title/Summary/Keyword: Speech/Non-speech Detection

Search Result 46, Processing Time 0.028 seconds

A Low Bit Rate Speech Coder Based on the Inflection Point Detection

  • Iem, Byeong-Gwan
    • International Journal of Fuzzy Logic and Intelligent Systems
    • /
    • v.15 no.4
    • /
    • pp.300-304
    • /
    • 2015
  • A low bit rate speech coder based on the non-uniform sampling technique is proposed. The non-uniform sampling technique is based on the detection of inflection points (IP). A speech block is processed by the IP detector, and the detected IP pattern is compared with entries of the IP database. The address of the closest member of the database is transmitted with the energy of the speech block. In the receiver, the decoder reconstructs the speech block using the received address and the energy information of the block. As results, the coder shows fixed data rate contrary to the existing speech coders based on the non-uniform sampling. Through computer simulation, the usefulness of the proposed technique is shown. The SNR performance of the proposed method is approximately 5.27 dB with the data rate of 1.5 kbps.

A Simple Speech/Non-speech Classifier Using Adaptive Boosting

  • Kwon, Oh-Wook;Lee, Te-Won
    • The Journal of the Acoustical Society of Korea
    • /
    • v.22 no.3E
    • /
    • pp.124-132
    • /
    • 2003
  • We propose a new method for speech/non-speech classifiers based on concepts of the adaptive boosting (AdaBoost) algorithm in order to detect speech for robust speech recognition. The method uses a combination of simple base classifiers through the AdaBoost algorithm and a set of optimized speech features combined with spectral subtraction. The key benefits of this method are the simple implementation, low computational complexity and the avoidance of the over-fitting problem. We checked the validity of the method by comparing its performance with the speech/non-speech classifier used in a standard voice activity detector. For speech recognition purpose, additional performance improvements were achieved by the adoption of new features including speech band energies and MFCC-based spectral distortion. For the same false alarm rate, the method reduced 20-50% of miss errors.

Voice Activity Detection Based on SNR and Non-Intrusive Speech Intelligibility Estimation

  • An, Soo Jeong;Choi, Seung Ho
    • International Journal of Internet, Broadcasting and Communication
    • /
    • v.11 no.4
    • /
    • pp.26-30
    • /
    • 2019
  • This paper proposes a new voice activity detection (VAD) method which is based on SNR and non-intrusive speech intelligibility estimation. In the conventional SNR-based VAD methods, voice activity probability is obtained by estimating frame-wise SNR at each spectral component. However these methods lack performance in various noisy environments. We devise a hybrid VAD method that uses non-intrusive speech intelligibility estimation as well as SNR estimation, where the speech intelligibility score is estimated based on deep neural network. In order to train model parameters of deep neural network, we use MFCC vector and the intrusive speech intelligibility score, STOI (Short-Time Objective Intelligent Measure), as input and output, respectively. We developed speech presence measure to classify each noisy frame as voice or non-voice by calculating the weighted average of the estimated STOI value and the conventional SNR-based VAD value at each frame. Experimental results show that the proposed method has better performance than the conventional VAD method in various noisy environments, especially when the SNR is very low.

A Fixed Rate Speech Coder Based on the Filter Bank Method and the Inflection Point Detection

  • Iem, Byeong-Gwan
    • International Journal of Fuzzy Logic and Intelligent Systems
    • /
    • v.16 no.4
    • /
    • pp.276-280
    • /
    • 2016
  • A fixed rate speech coder based on the filter bank and the non-uniform sampling technique is proposed. The non-uniform sampling is achieved by the detection of inflection points (IPs). A speech block is band passed by the filter bank, and the subband signals are processed by the IP detector, and the detected IP patterns are compared with entries of the IP database. For each subband signal, the address of the closest member of the database and the energy of the IP pattern are transmitted through channel. In the receiver, the decoder recovers the subband signals using the received addresses and the energy information, and reconstructs the speech via the filter bank summation. As results, the coder shows fixed data rate contrary to the existing speech coders based on the non-uniform sampling. Through computer simulation, the usefulness of the proposed technique is confirmed. The signal-to-noise ratio (SNR) performance of the proposed method is comparable to that of the uniform sampled pulse code modulation (PCM) below 20 kbps data rate.

Robust Voice Activity Detection in Noisy Environment Using Entropy and Harmonics Detection (엔트로피와 하모닉 검출을 이용한 잡음환경에 강인한 음성검출)

  • Choi, Gab-Keun;Kim, Soon-Hyob
    • Journal of the Institute of Electronics Engineers of Korea SP
    • /
    • v.47 no.1
    • /
    • pp.169-174
    • /
    • 2010
  • This paper explains end-point detection method for better speech recognition rates. The proposed method determines speech and non-speech region with the entropy and the harmonic detection of speech. The end-point detection using entropy on the speech spectral energy has good performance at the high SNR(SNR 15dB) environments. At the low SNR environment(SNR 0dB), however, the threshold level of speech and noise varies, so the precise end-point detection is difficult. Therefore, this paper introduces the end-point detection methods which uses speech spectral entropy and harmonics. Experiment shows better performance than the conventional entropy methods.

Weighted Finite State Transducer-Based Endpoint Detection Using Probabilistic Decision Logic

  • Chung, Hoon;Lee, Sung Joo;Lee, Yun Keun
    • ETRI Journal
    • /
    • v.36 no.5
    • /
    • pp.714-720
    • /
    • 2014
  • In this paper, we propose the use of data-driven probabilistic utterance-level decision logic to improve Weighted Finite State Transducer (WFST)-based endpoint detection. In general, endpoint detection is dealt with using two cascaded decision processes. The first process is frame-level speech/non-speech classification based on statistical hypothesis testing, and the second process is a heuristic-knowledge-based utterance-level speech boundary decision. To handle these two processes within a unified framework, we propose a WFST-based approach. However, a WFST-based approach has the same limitations as conventional approaches in that the utterance-level decision is based on heuristic knowledge and the decision parameters are tuned sequentially. Therefore, to obtain decision knowledge from a speech corpus and optimize the parameters at the same time, we propose the use of data-driven probabilistic utterance-level decision logic. The proposed method reduces the average detection failure rate by about 14% for various noisy-speech corpora collected for an endpoint detection evaluation.

Detection and Recognition Method for Emergency and Non-emergency Speech by Gaussian Mixture Model (GMM을 이용한 응급 단어와 비응급 단어의 검출 및 인식 기법)

  • Cho, Young-Im;Lee, Dae-Jong
    • Journal of the Korean Institute of Intelligent Systems
    • /
    • v.21 no.2
    • /
    • pp.254-259
    • /
    • 2011
  • For the emergency detecting in general CCTV environment of our daily life, the monitoring by only images through CCTV information occurs some problems especially in cost as well as man power. Therefore, in this paper, for detecting emergency state dynamically through CCTV as well as resolving some problems, we propose a detection and recognition method for emergency and non-emergency speech by GMM. The proposed method determine whether input speech is emergency or non-emergency speech by global GMM. If emergeny speech, local GMM is performed to classify the type of emergency speech. The proposed method is tested and verified by emergency and non-emergency speeches in various environmental conditions.

Voice Activity Detection Based on Entropy in Noisy Car Environment (차량 잡음 환경에서 엔트로피 기반의 음성 구간 검출)

  • Roh, Yong-Wan;Lee, Kue-Bum;Lee, Woo-Seok;Hong, Kwang-Seok
    • Journal of the Institute of Convergence Signal Processing
    • /
    • v.9 no.2
    • /
    • pp.121-128
    • /
    • 2008
  • Accurate voice activity detection have a great impact on performance of speech applications including speech recognition, speech coding, and speech communication. In this paper, we propose methods for voice activity detection that can adapt to various car noise situations during driving. Existing voice activity detection used various method such as time energy, frequency energy, zero crossing rate, and spectral entropy that have a weak point of rapid. decline performance in noisy environments. In this paper, the approach is based on existing spectral entropy for VAD that we propose voice activity detection method using MFB(Met-frequency filter banks) spectral entropy, gradient FFT(Fast Fourier Transform) spectral entropy. and gradient MFB spectral entropy. FFT multiplied by Mel-scale is MFB and Mel-scale is non linear scale when human sound perception reflects characteristic of speech. Proposed MFB spectral entropy method clearly improve the ability to discriminate between speech and non-speech for various in noisy car environments that achieves 93.21% accuracy as a result of experiments. Compared to the spectral entropy method, the proposed voice activity detection gives an average improvement in the correct detection rate of more than 3.2%.

  • PDF

Comparison Research of Non-Target Sentence Rejection on Phoneme-Based Recognition Networks (음소기반 인식 네트워크에서의 비인식 대상 문장 거부 기능의 비교 연구)

  • Kim, Hyung-Tai;Ha, Jin-Young
    • MALSORI
    • /
    • no.59
    • /
    • pp.27-51
    • /
    • 2006
  • For speech recognition systems, rejection function as well as decoding function is necessary to improve the reliability. There have been many research efforts on out-of-vocabulary word rejection, however, little attention has been paid on non-target sentence rejection. Recently pronunciation approaches using speech recognition increase the need for non-target sentence rejection to provide more accurate and robust results. In this paper, we proposed filler model method and word/phoneme detection ratio method to implement non-target sentence rejection system. We made performance evaluation of filler model along to word-level, phoneme-level, and sentence-level filler models respectively. We also perform the similar experiment using word-level and phoneme-level word/phoneme detection ratio method. For the performance evaluation, the minimized average of FAR and FRR is used for comparing the effectiveness of each method along with the number of words of given sentences. From the experimental results, we got to know that word-level method outperforms the other methods, and word-level filler mode shows slightly better results than that of word detection ratio method.

  • PDF

DNN based Speech Detection for the Media Audio (미디어 오디오에서의 DNN 기반 음성 검출)

  • Jang, Inseon;Ahn, ChungHyun;Seo, Jeongil;Jang, Younseon
    • Journal of Broadcast Engineering
    • /
    • v.22 no.5
    • /
    • pp.632-642
    • /
    • 2017
  • In this paper, we propose a DNN based speech detection system using acoustic characteristics and context information of media audio. The speech detection for discriminating between speech and non-speech included in the media audio is a necessary preprocessing technique for effective speech processing. However, since the media audio signal includes various types of sound sources, it has been difficult to achieve high performance with the conventional signal processing techniques. The proposed method improves the speech detection performance by separating the harmonic and percussive components of the media audio and constructing the DNN input vector reflecting the acoustic characteristics and context information of the media audio. In order to verify the performance of the proposed system, a data set for speech detection was made using more than 20 hours of drama, and an 8-hour Hollywood movie data set, which was publicly available, was further acquired and used for experiments. In the experiment, it is shown that the proposed system provides better performance than the conventional method through the cross validation for two data sets.