• Title/Summary/Keyword: Word error rate

Search Result 125, Processing Time 0.04 seconds

A Study on Speaker Normalization using VTN (VTN을 이용한 화자 정규화에 관한 연구)

  • 손창희;손종목;배건성
    • Proceedings of the IEEK Conference
    • /
    • 2001.09a
    • /
    • pp.499-502
    • /
    • 2001
  • 본 연구에서는 화자에 따라 서로 다른 성도의 길이에 의해 발생하는 음성인식 시스템의 성능 저하를 줄이기 위하여, VTN(Vocal Tract Normalization)을 음성인식 시스템에 적용하고, 주소 인식 실험을 통하여 인식 성능을 평가하였다. 또, VTN을 CMN과 동시에 적용하여 인식 실험을 하였다. 실험에서는 화자간 성도길이의 차이를 반영하기 위하여 13개의 Warping 계수에 대해 필터 뱅크를 이용한 선형 Warping 방법을 적용하였다. 실험결과, Baseline 인식 시스템에 비하여 VTN을 적용하면, WER(Word Error Rate)이 1.24% 감소하였고, CMN과 VTN을 동시에 적용한 실험에서는 Baseline 인식 시스템과 비교하여 WER이 0.33% 감소 하였지만 VTN을 적용한 실험결과와 비교하면 오히려 0.91% 증가하였다.

  • PDF

Maximum Likelihood Training and Adaptation of Embedded Speech Recognizers for Mobile Environments

  • Cho, Young-Kyu;Yook, Dong-Suk
    • ETRI Journal
    • /
    • v.32 no.1
    • /
    • pp.160-162
    • /
    • 2010
  • For the acoustic models of embedded speech recognition systems, hidden Markov models (HMMs) are usually quantized and the original full space distributions are represented by combinations of a few quantized distribution prototypes. We propose a maximum likelihood objective function to train the quantized distribution prototypes. The experimental results show that the new training algorithm and the link structure adaptation scheme for the quantized HMMs reduce the word recognition error rate by 20.0%.

Decoder Design of a Nonbinary Code in the System with a High Code Rate (코드 레이트가 높은 시스템에 있어서의 비이진코드의 디코더 설계)

  • 정일석;강창언
    • The Journal of Korean Institute of Communications and Information Sciences
    • /
    • v.11 no.1
    • /
    • pp.53-63
    • /
    • 1986
  • In this paper the decoder of nonbinary code satisfying R>1/t has been designed and constructed, where R is the code rate and t is the error correcting capability. In order to design the error trapping decoder, the concept of covering monomial is used and them the decoder system using the (15, 11) Reed-Solomon code is implemented. Without Galois Fiedl multiplication and division circuits, the decoder system is simply constructed. In the decoding process, it takes 60clocks to decode one code word. Two symbol errors and eight binary burst errors are simultaneously corrected. This coding system is shown to be efficient when the channel error probability is approximately from $5{\times}10^-4$~$5{\times}10^-5$.

  • PDF

Orthographic and phonological links in Korean lexical processing (한국어 어휘 처리 과정에서 글짜 정보와 발음 정보의 연결성)

  • Kim, Jee-Sun;Taft, Marcus
    • Annual Conference on Human and Language Technology
    • /
    • 1995.10a
    • /
    • pp.211-214
    • /
    • 1995
  • At what level of orthographic representation is phonology linked in thelexicon? Is it at the whole word level, the syllable level, letter level, etc? This question can be addressed by comparing the two scripts used in Korean, logographic Hanmoon and alphabetic/syllabic Hangul, on a task where judgements must be made about the phonology of a visually presented word. Four experiments are reported using a "homophone decision task" and manipulating the sub-lexical relationship between orthography and phonology in Hanmoon and Hangul, and the lexical status of the stimuli. Hangul words showed a much higher error rate in judging whether there was another word identically pronounced than both Hangul nonwords and Hanmoon words. It is concluded that the relationship between orthography and phonology in the lexicon differs according tn the type of script owing to the availability of sub-lexical information: the process of making a homophone derision is based on a spread of activation exclusively among lexical entries, from orthography to phonology and vice versa (called "Orthography-Phonology-Orthography Rebound" or "OPO Rebound"). The results are explained within the mulitilevel interactive activation model with orthographic units linked to phonological units at each level.

  • PDF

Rapid Speaker Adaptation Based on Eigenvoice Using Weight Distribution Characteristics (가중치 분포 특성을 이용한 Eigenvoice 기반 고속화자적응)

  • 박종세;김형순;송화전
    • The Journal of the Acoustical Society of Korea
    • /
    • v.22 no.5
    • /
    • pp.403-407
    • /
    • 2003
  • Recently, eigenvoice approach has been widely used for rapid speaker adaptation. However, even in the eigenvoice approach, Performance improvement using very small amount of adaptation data is relatively small in comparison with that using somewhat large adaptation data because the reliable estimation of weights of eigenvoice is difficult. In this paper, we propose a rapid speaker adaptation method based on eigenvoice using the weight distribution characteristics to improve the performance on a small adaptation data. In the Experimental results on vocabulary-independent word recognition task (using PBW 452 database), the weight threshold method alleviates the problem of relatively low performance for a tiny small adaptation data. When single adaptation word is used, word error rate is reduced about 9-18% by the weight threshold method.

Robust Speech Recognition Parameters for Emotional Variation (감정 변화에 강인한 음성 인식 파라메터)

  • Kim Weon-Goo
    • Journal of the Korean Institute of Intelligent Systems
    • /
    • v.15 no.6
    • /
    • pp.655-660
    • /
    • 2005
  • This paper studied the feature parameters less affected by the emotional variation for the development of the robust speech recognition technologies. For this purpose, the effect of emotional variation on the speech recognition system and robust feature parameters of speech recognition system were studied using speech database containing various emotions. In this study, LPC cepstral coefficient, met-cepstral coefficient, root-cepstral coefficient, PLP coefficient, RASTA met-cepstral coefficient were used as a feature parameters. And CMS and SBR method were used as a signal bias removal techniques. Experimental results showed that the HMM based speaker independent word recognizer using RASTA met-cepstral coefficient :md its derivatives and CMS as a signal bias removal showed the best performance of $7.05\%$ word error rate. This corresponds to about a $52\%$ word error reduction as compare to the performance of baseline system using met - cepstral coefficient.

Multimodal audiovisual speech recognition architecture using a three-feature multi-fusion method for noise-robust systems

  • Sanghun Jeon;Jieun Lee;Dohyeon Yeo;Yong-Ju Lee;SeungJun Kim
    • ETRI Journal
    • /
    • v.46 no.1
    • /
    • pp.22-34
    • /
    • 2024
  • Exposure to varied noisy environments impairs the recognition performance of artificial intelligence-based speech recognition technologies. Degraded-performance services can be utilized as limited systems that assure good performance in certain environments, but impair the general quality of speech recognition services. This study introduces an audiovisual speech recognition (AVSR) model robust to various noise settings, mimicking human dialogue recognition elements. The model converts word embeddings and log-Mel spectrograms into feature vectors for audio recognition. A dense spatial-temporal convolutional neural network model extracts features from log-Mel spectrograms, transformed for visual-based recognition. This approach exhibits improved aural and visual recognition capabilities. We assess the signal-to-noise ratio in nine synthesized noise environments, with the proposed model exhibiting lower average error rates. The error rate for the AVSR model using a three-feature multi-fusion method is 1.711%, compared to the general 3.939% rate. This model is applicable in noise-affected environments owing to its enhanced stability and recognition rate.

Pronunciation Variation Modeling for Korean Point-of-Interest Data Using Prosodic Information (운율 정보를 이용한 한국어 위치 정보 데이타의 발음 모델링)

  • Kim, Sun-He;Park, Jeon-Gue;Na, Min-Soo;Jeon, Je-Hun;Chung, Min-Wha
    • Journal of KIISE:Software and Applications
    • /
    • v.34 no.2
    • /
    • pp.104-111
    • /
    • 2007
  • This paper examines how the performance of an automatic speech recognizer was improved for Korean Point-of-Interest (POI) data by modeling pronunciation variation using structural prosodic information such as prosodic words and syllable length. First, multiple pronunciation variants are generated using prosodic words given that each POI word can be broken down into prosodic words. And the cross-prosodic-word variations were modeled considering the syllable length of word. A total of 81 experiments were conducted using 9 test sets (3 baseline and 6 proposed) on 9 trained sets (3 baseline, 6 proposed). The results show: (i) the performance was improved when the pronunciation lexica were generated using prosodic words; (ii) the best performance was achieved when the maximum number of variants was constrained to 3 based on the syllable length; and (iii) compared to the baseline word error rate (WER) of 4.63%, a maximum of 8.4% in WER reduction was achieved when both prosodic words and syllable length were considered.

A Spelling Error Correction Model in Korean Using a Correction Dictionary and a Newspaper Corpus (교정사전과 신문기사 말뭉치를 이용한 한국어 철자 오류 교정 모델)

  • Lee, Se-Hee;Kim, Hark-Soo
    • The KIPS Transactions:PartB
    • /
    • v.16B no.5
    • /
    • pp.427-434
    • /
    • 2009
  • With the rapid evolution of the Internet and mobile environments, text including spelling errors such as newly-coined words and abbreviated words are widely used. These spelling errors make it difficult to develop NLP (natural language processing) applications because they decrease the readability of texts. To resolve this problem, we propose a spelling error correction model using a spelling error correction dictionary and a newspaper corpus. The proposed model has the advantage that the cost of data construction are not high because it uses a newspaper corpus, which we can easily obtain, as a training corpus. In addition, the proposed model has an advantage that additional external modules such as a morphological analyzer and a word-spacing error correction system are not required because it uses a simple string matching method based on a correction dictionary. In the experiments with a newspaper corpus and a short message corpus collected from real mobile phones, the proposed model has been shown good performances (a miss-correction rate of 7.3%, a F1-measure of 97.3%, and a false positive rate of 1.1%) in the various evaluation measures.

Speech Recognition Accuracy Prediction Using Speech Quality Measure (음성 특성 지표를 이용한 음성 인식 성능 예측)

  • Ji, Seung-eun;Kim, Wooil
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.20 no.3
    • /
    • pp.471-476
    • /
    • 2016
  • This paper presents our study on speech recognition performance prediction. Our initial study shows that a combination of speech quality measures effectively improves correlation with Word Error Rate (WER) compared to each speech measure alone. In this paper we demonstrate a new combination of various types of speech quality measures shows more significantly improves correlation with WER compared to the speech measure combination of our initial study. In our study, SNR, PESQ, acoustic model score, and MFCC distance are used as the speech quality measures. This paper also presents our speech database verification system for speech recognition employing the speech measures. We develop a WER prediction system using Gaussian mixture model and the speech quality measures as a feature vector. The experimental results show the proposed system is highly effective at predicting WER in a low SNR condition of speech babble and car noise environments.