• Title/Summary/Keyword: robust speech recognition

Search Result 224, Processing Time 0.031 seconds

A Land and Maritime Unified Tourism Information Guide System Based on Robust Speech Recognition in Ship Noise Environments (선박 잡음 환경에서의 강건한 음성 인식 기반 육해상 통합 관광 정보 안내 시스템)

  • Jeon, Kwang Myung;Lee, Jang Won;Park, Ji Hun;Lee, Seong Ro;Lee, Yeonwoo;Maeng, Se Young;Kim, Hong Kook
    • The Journal of Korean Institute of Communications and Information Sciences
    • /
    • v.38C no.2
    • /
    • pp.189-195
    • /
    • 2013
  • In this paper, a land and maritime unified tourism information guide system is proposed which employs robust speech recognition in ship noise environments. Most of conventional front-ends for speech recognition have used a Wiener filter to compensate for stationary noise such as car or babble noises. However, such the conventional front-ends have limitation in reducing non-stationary noise that are occurred inside the ship on voyage. To overcome such a limitation, the proposed system incorporates nonlinear multi-band spectral subtraction to provide highly accurate tourism route recognition. It is shown from the experiment that compared to a conventional system the proposed system achieves relative improvement of a tourism route recognition rate by 5.54% under a noise condition of 10 dB signal-to-noise ratio (SNR).

Building robust Korean speech recognition model by fine-tuning large pretrained model (대형 사전훈련 모델의 파인튜닝을 통한 강건한 한국어 음성인식 모델 구축)

  • Changhan Oh;Cheongbin Kim;Kiyoung Park
    • Phonetics and Speech Sciences
    • /
    • v.15 no.3
    • /
    • pp.75-82
    • /
    • 2023
  • Automatic speech recognition (ASR) has been revolutionized with deep learning-based approaches, among which self-supervised learning methods have proven to be particularly effective. In this study, we aim to enhance the performance of OpenAI's Whisper model, a multilingual ASR system on the Korean language. Whisper was pretrained on a large corpus (around 680,000 hours) of web speech data and has demonstrated strong recognition performance for major languages. However, it faces challenges in recognizing languages such as Korean, which is not major language while training. We address this issue by fine-tuning the Whisper model with an additional dataset comprising about 1,000 hours of Korean speech. We also compare its performance against a Transformer model that was trained from scratch using the same dataset. Our results indicate that fine-tuning the Whisper model significantly improved its Korean speech recognition capabilities in terms of character error rate (CER). Specifically, the performance improved with increasing model size. However, the Whisper model's performance on English deteriorated post fine-tuning, emphasizing the need for further research to develop robust multilingual models. Our study demonstrates the potential of utilizing a fine-tuned Whisper model for Korean ASR applications. Future work will focus on multilingual recognition and optimization for real-time inference.

Creation and Assessment of Korean Speech and Noise DB in Car Environments (자동차 환경에서의 노이즈 DB 및 한국어 음성 DB 구축)

  • Lee Kwang-Hyun;Kim Bong-Wan;Lee Yong-Ju
    • MALSORI
    • /
    • no.48
    • /
    • pp.141-153
    • /
    • 2003
  • Researches into robust recognition in noise environments, especially in car environments, are being carried out actively in speech community. In this paper we will report on three types of corpora that SiTEC (Speech Information TEchnology & industry promotion Center) has created for research into speech recognition in car noise environments. The first is the recordings of 900 Korean native speakers, distributed according to gender, age, and region, who uttered application words in car environments. The second is the collections of mixed noise in 3 car types by model while setting up various noise patterns which can be obtained with the car engine on or off, at different driving speed, and in different road conditions with windows open or closed. The third is the recordings of simulated speech by HATS (Head and Torso Simulator) in car environments with the internal and external noise factors added. These three types of recordings were all made through synchronized 8 channel microphones that are fixed in a car. The creation and applications of these corpora will be reported on in detail.

  • PDF

Performance Improvement ofSpeech Recognition Based on SPLICEin Noisy Environments (SPLICE 방법에 기반한 잡음 환경에서의 음성 인식 성능 향상)

  • Kim, Jong-Hyeon;Song, Hwa-Jeon;Lee, Jong-Seok;Kim, Hyung-Soon
    • MALSORI
    • /
    • no.53
    • /
    • pp.103-118
    • /
    • 2005
  • The performance of speech recognition system is degraded by mismatch between training and test environments. Recently, Stereo-based Piecewise LInear Compensation for Environments (SPLICE) was introduced to overcome environmental mismatch using stereo data. In this paper, we propose several methods to improve the conventional SPLICE and evaluate them in the Aurora2 task. We generalize SPLICE to compensate for covariance matrix as well as mean vector in the feature space, and thereby yielding the error rate reduction of 48.93%. We also employ the weighted sum of correction vectors using posterior probabilities of all Gaussians, and the error rate reduction of 48.62% is achieved. With the combination of the above two methods, the error rate is reduced by 49.61% from the Aurora2 baseline system.

  • PDF

Robust Speech Recognition Using Independent Component Analysis (독립성분분석을 이용한 강인한 음성인식)

  • 임형규;이창기
    • Journal of the Korea Computer Industry Society
    • /
    • v.5 no.2
    • /
    • pp.269-274
    • /
    • 2004
  • Noisy speech recognition is one of most important problems in speech recognition. In this paper, a method which efficiently removes the mixed noise with speech, is proposed. The proposed method is based on the ICA to separate the mixed noise. ICA(Independent component analysis) is a signal processing technique, whose goal is to express a set of random variables as linear combinations of components that are statistically as independent from each other as possible.

  • PDF

Enhancement of Rejection Performance using the PSO-NCM in Noisy Environment (잡음 환경하에서의 PSO-NCM을 이용한 거절기능 성능 향상)

  • Kim, Byoung-Don;Song, Min-Gyu;Choi, Seung-Ho;Kim, Jin-Young
    • Speech Sciences
    • /
    • v.15 no.4
    • /
    • pp.85-96
    • /
    • 2008
  • Automatic speech recognition has severe performance degradation under noisy environments. To cope with the noise problem, many methods have been proposed. Most of them focused on noise-robust features or model adaptation. However, researchers have overlooked utterance verification (UV) under noisy environments. In this paper we discuss UV problems based on the normalized confidence measure. First, we show that UV performance is also degraded in noisy environments with the experiments of an isolated word recognition. Then we observe how the degradation of UV performances is suffered. Based on the UV experiments we propose a modeling method of the statistics of phone confidences using sigmoid functions. For obtaining the parameters of the sigmoidal models, the particle swarm optimization (PSO) is adopted. The proposed method improves 20% rejection performance. Our experimental results show that the PSO-NCM can apply noise speech recognition successfully.

  • PDF

Frame Reliability Weighting for Robust Speech Recognition (프레임 신뢰도 가중에 의한 강인한 음성인식)

  • 조훈영;김락용;오영환
    • The Journal of the Acoustical Society of Korea
    • /
    • v.21 no.3
    • /
    • pp.323-329
    • /
    • 2002
  • This paper proposes a frame reliability weighting method to compensate for a time-selective noise that occurs at random positions of speech signal contaminating certain parts of the speech signal. Speech frames have different degrees of reliability and the reliability is proportional to SNR (signal-to noise ratio). While it is feasible to estimate frame Sl? by using the noise information from non-speech interval under a stationary noisy situation, it is difficult to obtain noise spectrum for a time-selective noise. Therefore, we used statistical models of clean speech for the estimation of the frame reliability. The proposed MFR (model-based frame reliability) approximates frame SNR values using filterbank energy vectors that are obtained by the inverse transformation of input MFCC (mal-frequency cepstral coefficient) vectors and mean vectors of a reference model. Experiments on various burnt noises revealed that the proposed method could represent the frame reliability effectively. We could improve the recognition performance by using MFR values as weighting factors at the likelihood calculation step.

Recognition Performance Improvement of Unsupervised Limabeam Algorithm using Post Filtering Technique

  • Nguyen, Dinh Cuong;Choi, Suk-Nam;Chung, Hyun-Yeol
    • IEMEK Journal of Embedded Systems and Applications
    • /
    • v.8 no.4
    • /
    • pp.185-194
    • /
    • 2013
  • Abstract- In distant-talking environments, speech recognition performance degrades significantly due to noise and reverberation. Recent work of Michael L. Selzer shows that in microphone array speech recognition, the word error rate can be significantly reduced by adapting the beamformer weights to generate a sequence of features which maximizes the likelihood of the correct hypothesis. In this approach, called Likelihood Maximizing Beamforming algorithm (Limabeam), one of the method to implement this Limabeam is an UnSupervised Limabeam(USL) that can improve recognition performance in any situation of environment. From our investigation for this USL, we could see that because the performance of optimization depends strongly on the transcription output of the first recognition step, the output become unstable and this may lead lower performance. In order to improve recognition performance of USL, some post-filter techniques can be employed to obtain more correct transcription output of the first step. In this work, as a post-filtering technique for first recognition step of USL, we propose to add a Wiener-Filter combined with Feature Weighted Malahanobis Distance to improve recognition performance. We also suggest an alternative way to implement Limabeam algorithm for Hidden Markov Network (HM-Net) speech recognizer for efficient implementation. Speech recognition experiments performed in real distant-talking environment confirm the efficacy of Limabeam algorithm in HM-Net speech recognition system and also confirm the improved performance by the proposed method.

On-line model compensation using noise masking effect for robust speech recognition (잡음 차폐를 이용한 온라인 모델 보상)

  • Jung Gue-Jun;Cho Hoon-Young;Oh Yung-Hwan
    • Proceedings of the KSPS conference
    • /
    • 2003.05a
    • /
    • pp.215-218
    • /
    • 2003
  • In this paper we apply PMC (parallel model combination) to speech recognition system online. As a representative of model based noise compensation techniques, PMC compensates environmental mismatch by combining pretrained clean speech models and real-time estimated noise information. This is very effective approach for compensating extreme environmental mismatch but is inadequate to use in on-line system for heavy computational cost. To reduce the computational cost and to apply PMC online, we use a noise masking effect - the energy in a frequency band is dominated either by clean speech energy or by noise energy - in the process of model compensation. Experiments on artificially produced noisy speech data confirm that the proposed technique is fast and effective for the on-line model compensation.

  • PDF

Combination Tandem Architecture with Segmental Features for Robust Speech Recognition (강인한 음성 인식을 위한 탠덤 구조와 분절 특징의 결합)

  • Yun, Young-Sun;Lee, Yun-Keun
    • MALSORI
    • /
    • no.62
    • /
    • pp.113-131
    • /
    • 2007
  • It is reported that the segmental feature based recognition system shows better results than conventional feature based system in the previous studies. On the other hand, the various studies of combining neural network and hidden Markov models within a single system are done with expectations that it may potentially combine the advantages of both systems. With the influence of these studies, tandem approach was presented to use neural network as the classifier and hidden Markov models as the decoder. In this paper, we applied the trend information of segmental features to tandem architecture and used posterior probabilities, which are the output of neural network, as inputs of recognition system. The experiments are performed on Auroral database to examine the potentiality of the trend feature based tandem architecture. From the results, the proposed system outperforms on very low SNR environments. Consequently, we argue that the trend information on tandem architecture can be additionally used for traditional MFCC features.

  • PDF