• Title/Summary/Keyword: small-data speech recognition

Search Result 48, Processing Time 0.026 seconds

Selecting Good Speech Features for Recognition

  • Lee, Young-Jik;Hwang, Kyu-Woong
    • ETRI Journal
    • /
    • v.18 no.1
    • /
    • pp.29-41
    • /
    • 1996
  • This paper describes a method to select a suitable feature for speech recognition using information theoretic measure. Conventional speech recognition systems heuristically choose a portion of frequency components, cepstrum, mel-cepstrum, energy, and their time differences of speech waveforms as their speech features. However, these systems never have good performance if the selected features are not suitable for speech recognition. Since the recognition rate is the only performance measure of speech recognition system, it is hard to judge how suitable the selected feature is. To solve this problem, it is essential to analyze the feature itself, and measure how good the feature itself is. Good speech features should contain all of the class-related information and as small amount of the class-irrelevant variation as possible. In this paper, we suggest a method to measure the class-related information and the amount of the class-irrelevant variation based on the Shannon's information theory. Using this method, we compare the mel-scaled FFT, cepstrum, mel-cepstrum, and wavelet features of the TIMIT speech data. The result shows that, among these features, the mel-scaled FFT is the best feature for speech recognition based on the proposed measure.

  • PDF

End-to-end speech recognition models using limited training data (제한된 학습 데이터를 사용하는 End-to-End 음성 인식 모델)

  • Kim, June-Woo;Jung, Ho-Young
    • Phonetics and Speech Sciences
    • /
    • v.12 no.4
    • /
    • pp.63-71
    • /
    • 2020
  • Speech recognition is one of the areas actively commercialized using deep learning and machine learning techniques. However, the majority of speech recognition systems on the market are developed on data with limited diversity of speakers and tend to perform well on typical adult speakers only. This is because most of the speech recognition models are generally learned using a speech database obtained from adult males and females. This tends to cause problems in recognizing the speech of the elderly, children and people with dialects well. To solve these problems, it may be necessary to retain big database or to collect a data for applying a speaker adaptation. However, this paper proposes that a new end-to-end speech recognition method consists of an acoustic augmented recurrent encoder and a transformer decoder with linguistic prediction. The proposed method can bring about the reliable performance of acoustic and language models in limited data conditions. The proposed method was evaluated to recognize Korean elderly and children speech with limited amount of training data and showed the better performance compared of a conventional method.

A Study on Design and Implementation of Speech Recognition System Using ART2 Algorithm

  • Kim, Joeng Hoon;Kim, Dong Han;Jang, Won Il;Lee, Sang Bae
    • International Journal of Fuzzy Logic and Intelligent Systems
    • /
    • v.4 no.2
    • /
    • pp.149-154
    • /
    • 2004
  • In this research, we selected the speech recognition to implement the electric wheelchair system as a method to control it by only using the speech and used DTW (Dynamic Time Warping), which is speaker-dependent and has a relatively high recognition rate among the speech recognitions. However, it has to have small memory and fast process speed performance under consideration of real-time. Thus, we introduced VQ (Vector Quantization) which is widely used as a compression algorithm of speaker-independent recognition, to secure fast recognition and small memory. However, we found that the recognition rate decreased after using VQ. To improve the recognition rate, we applied ART2 (Adaptive Reason Theory 2) algorithm as a post-process algorithm to obtain about 5% recognition rate improvement. To utilize ART2, we have to apply an error range. In case that the subtraction of the first distance from the second distance for each distance obtained to apply DTW is 20 or more, the error range is applied. Likewise, ART2 was applied and we could obtain fast process and high recognition rate. Moreover, since this system is a moving object, the system should be implemented as an embedded one. Thus, we selected TMS320C32 chip, which can process significantly many calculations relatively fast, to implement the embedded system. Considering that the memory is speech, we used 128kbyte-RAM and 64kbyte ROM to save large amount of data. In case of speech input, we used 16-bit stereo audio codec, securing relatively accurate data through high resolution capacity.

Effective Recognition of Velopharyngeal Insufficiency (VPI) Patient's Speech Using DNN-HMM-based System (DNN-HMM 기반 시스템을 이용한 효과적인 구개인두부전증 환자 음성 인식)

  • Yoon, Ki-mu;Kim, Wooil
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.23 no.1
    • /
    • pp.33-38
    • /
    • 2019
  • This paper proposes an effective recognition method of VPI patient's speech employing DNN-HMM-based speech recognition system, and evaluates the recognition performance compared to GMM-HMM-based system. The proposed method employs speaker adaptation technique to improve VPI speech recognition. This paper proposes to use simulated VPI speech for generating a prior model for speaker adaptation and selective learning of weight matrices of DNN, in order to effectively utilize the small size of VPI speech for model adaptation. We also apply Linear Input Network (LIN) based model adaptation technique for the DNN model. The proposed speaker adaptation method brings 2.35% improvement in average accuracy compared to GMM-HMM based ASR system. The experimental results demonstrate that the proposed DNN-HMM-based speech recognition system is effective for VPI speech with small-sized speech data, compared to conventional GMM-HMM system.

A Study on Realization of Continuous Speech Recognition System of Speaker Adaptation (화자적응화 연속음성 인식 시스템의 구현에 관한 연구)

  • 김상범;김수훈;허강인;고시영
    • The Journal of the Acoustical Society of Korea
    • /
    • v.18 no.3
    • /
    • pp.10-16
    • /
    • 1999
  • In this paper, we have studied Continuous Speech Recognition System of Speaker Adaptation using MAPE (Maximum A Posteriori Probability Estimation) which can adapt any small amount of adaptation speech data. Speaker adaptation is performed by the method of MAPB after Concatenation training which is making sentence unit HMM linked by syllable unit HMM and Viterbi segmentation classifies speech data to be adaptation into segmentation of syllable unit data automatically without hand labelling. For car control speech the recognition rates of adaptation of HMM was 77.18% which is approximately 6% improvement over that of unadapted HMM.(in case of O(n)DP)

  • PDF

Robust Speech Recognition Using Missing Data Theory (손실 데이터 이론을 이용한 강인한 음성 인식)

  • 김락용;조훈영;오영환
    • The Journal of the Acoustical Society of Korea
    • /
    • v.20 no.3
    • /
    • pp.56-62
    • /
    • 2001
  • In this paper, we adopt a missing data theory to speech recognition. It can be used in order to maintain high performance of speech recognizer when the missing data occurs. In general, hidden Markov model (HMM) is used as a stochastic classifier for speech recognition task. Acoustic events are represented by continuous probability density function in continuous density HMM(CDHMM). The missing data theory has an advantage that can be easily applicable to this CDHMM. A marginalization method is used for processing missing data because it has small complexity and is easy to apply to automatic speech recognition (ASR). Also, a spectral subtraction is used for detecting missing data. If the difference between the energy of speech and that of background noise is below given threshold value, we determine that missing has occurred. We propose a new method that examines the reliability of detected missing data using voicing probability. The voicing probability is used to find voiced frames. It is used to process the missing data in voiced region that has more redundant information than consonants. The experimental results showed that our method improves performance than baseline system that uses spectral subtraction method only. In 452 words isolated word recognition experiment, the proposed method using the voicing probability reduced the average word error rate by 12% in a typical noise situation.

  • PDF

Effective speech recognition system for patients with Parkinson's disease (파킨슨병 환자에 대한 효과적인 음성인식 시스템)

  • Huiyong, Bak;Ryul, Kim;Sangmin, Lee
    • The Journal of the Acoustical Society of Korea
    • /
    • v.41 no.6
    • /
    • pp.655-661
    • /
    • 2022
  • Since speech impairment is prevalent in patients with Parkinson's disease (PD), speech recognition systems suitable for these patients are needed. In this paper, we propose a speech recognition system that effectively recognizes the speech of patients with PD. The speech recognition system is firstly pre-trained with the Globalformer using the speech data from healthy people, and then fine-tuned using relatively small amount of speech data from the patient with PD. For this analysis, we used the speech dataset of healthy people built by AI hub and that of patients with PD collected at Inha University Hospital. As a result of the experiment, the proposed speech recognition system recognized the speech of patients with PD with Character Error Rate (CER) of 22.15 %, which was a better result compared to other methods.

Implementation of Vocabulary- Independent Speech Recognizer Using a DSP (DSP를 이용한 가변어휘 음성인식기 구현에 관한 연구)

  • Chung, Ik-Joo
    • Speech Sciences
    • /
    • v.11 no.3
    • /
    • pp.143-156
    • /
    • 2004
  • In this paper, we implemented a vocabulary-independent speech recognizer using the TMS320VC33 DSP. For this implementation, we had developed very small-sized recognition engine based on diphone sub-word unit, which is especially suited for embedded applications where the system resources are severely limited. The recognition accuracy of the developed recognizer with 1 mixture per state and 4 states per diphone is 94.5% when tested on frequently-used 2000 words set. The design of the hardware was focused on minimal use of parts, which results in reduced material cost. The finally developed hardware only includes a DSP, 512 Kword flash ROM and a voice codec. In porting the recognition engine to the DSP, we introduced several methods of using data and program memory efficiently and developed the versatile software protocol for host interface. Finally, we also made an evaluation board for testing the developed hardware recognition module.

  • PDF

N- gram Adaptation Using Information Retrieval and Dynamic Interpolation Coefficient (정보검색 기법과 동적 보간 계수를 이용한 N-gram 언어모델의 적응)

  • Choi Joon Ki;Oh Yung-Hwan
    • MALSORI
    • /
    • no.56
    • /
    • pp.207-223
    • /
    • 2005
  • The goal of language model adaptation is to improve the background language model with a relatively small adaptation corpus. This study presents a language model adaptation technique where additional text data for the adaptation do not exist. We propose the information retrieval (IR) technique with N-gram language modeling to collect the adaptation corpus from baseline text data. We also propose to use a dynamic language model interpolation coefficient to combine the background language model and the adapted language model. The interpolation coefficient is estimated from the word hypotheses obtained by segmenting the input speech data reserved for held-out validation data. This allows the final adapted model to improve the performance of the background model consistently The proposed approach reduces the word error rate by $13.6\%$ relative to baseline 4-gram for two-hour broadcast news speech recognition.

  • PDF

Exploring the feasibility of fine-tuning large-scale speech recognition models for domain-specific applications: A case study on Whisper model and KsponSpeech dataset

  • Jungwon Chang;Hosung Nam
    • Phonetics and Speech Sciences
    • /
    • v.15 no.3
    • /
    • pp.83-88
    • /
    • 2023
  • This study investigates the fine-tuning of large-scale Automatic Speech Recognition (ASR) models, specifically OpenAI's Whisper model, for domain-specific applications using the KsponSpeech dataset. The primary research questions address the effectiveness of targeted lexical item emphasis during fine-tuning, its impact on domain-specific performance, and whether the fine-tuned model can maintain generalization capabilities across different languages and environments. Experiments were conducted using two fine-tuning datasets: Set A, a small subset emphasizing specific lexical items, and Set B, consisting of the entire KsponSpeech dataset. Results showed that fine-tuning with targeted lexical items increased recognition accuracy and improved domain-specific performance, with generalization capabilities maintained when fine-tuned with a smaller dataset. For noisier environments, a trade-off between specificity and generalization capabilities was observed. This study highlights the potential of fine-tuning using minimal domain-specific data to achieve satisfactory results, emphasizing the importance of balancing specialization and generalization for ASR models. Future research could explore different fine-tuning strategies and novel technologies such as prompting to further enhance large-scale ASR models' domain-specific performance.