• Title/Summary/Keyword: phoneme HMM

Search Result 62, Processing Time 0.024 seconds

A Study on Recognition Units and Methods to Align Training Data for Korean Speech Recognition) (한국어 인식을 위한 인식 단위와 학습 데이터 분류 방법에 대한 연구)

  • 황영수
    • Journal of the Institute of Convergence Signal Processing
    • /
    • v.4 no.2
    • /
    • pp.40-45
    • /
    • 2003
  • This is the study on recognition units and segmentation of phonemes. In the case of making large vocabulary speech recognition system, it is better to use the segment than the syllable or the word as the recognition unit. In this paper, we study on the proper recognition units and segmentation of phonemes for Korean speech recognition. For experiments, we use the speech toolkit of OGI in U.S.A. The result shows that the recognition rate of the case in which the diphthong is established as a single unit is superior to that of the case in which the diphthong is established as two units, i.e. a glide plus a vowel. And recognizer using manually-aligned training data is a little superior to that using automatically-aligned training data. Also, the recognition rate of the case in which the bipbone is used as the recognition unit is better than that of the case in which the mono-Phoneme is used.

  • PDF

Improvement of Naturalness for a HMM-based Korean TTS using the prosodic boundary information (운율경계정보를 이용한 HMM기반 한국어 TTS 자연성 향상 연구)

  • Lim, Gi-Jeong;Lee, Jung-Chul
    • Journal of the Korea Society of Computer and Information
    • /
    • v.17 no.9
    • /
    • pp.75-84
    • /
    • 2012
  • HMM-based Text-to-Speech systems generally utilize context dependent tri-phone units from a large corpus speech DB to enhance the synthetic speech. To downsize a large corpus speech DB, acoustically similar tri-phone units are clustered based on the decision tree using context dependent information. Context dependent information includes phoneme sequence as well as prosodic information because the naturalness of synthetic speech highly depends on the prosody such as pause, intonation pattern, and segmental duration. However, if the prosodic information was complicated, many context dependent phonemes would have no examples in the training data, and clustering would provide a smoothed feature which will generate unnatural synthetic speech. In this paper, instead of complicate prosodic information we propose a simple three prosodic boundary types and decision tree questions that use rising tone, falling tone, and monotonic tone to improve naturalness. Experimental results show that our proposed method can improve naturalness of a HMM-based Korean TTS and get high MOS in the perception test.

A Study on Spoken Digits Analysis and Recognition (숫자음 분석과 인식에 관한 연구)

  • 김득수;황철준
    • Journal of Korea Society of Industrial Information Systems
    • /
    • v.6 no.3
    • /
    • pp.107-114
    • /
    • 2001
  • This paper describes Connected Digit Recognition with Considering Acoustic Feature in Korea. The recognition rate of connected digit is usually lower than word recognition. Therefore, speech feature parameter and acoustic feature are employed to make robust model for digit, and we could confirm the effect of Considering. Acoustic Feature throughout the experience of recognition. We used KLE 4 connected digit as database and 19 continuous distributed HMM as PLUs(Phoneme Like Units) using phonetical rules. For recognition experience, we have tested two cases. The first case, we used usual method like using Mel-Cepstrum and Regressive Coefficient for constructing phoneme model. The second case, we used expanded feature parameter and acoustic feature for constructing phoneme model. In both case, we employed OPDP(One Pass Dynamic Programming) and FSA(Finite State Automata) for recognition tests. When appling FSN for recognition, we applied various acoustic features. As the result, we could get 55.4% recognition rate for Mel-Cepstrum, and 67.4% for Mel-Cepstrum and Regressive Coefficient. Also, we could get 74.3% recognition rate for expanded feature parameter, and 75.4% for applying acoustic feature. Since, the case of applying acoustic feature got better result than former method, we could make certain that suggested method is effective for connected digit recognition in korean.

  • PDF

A Recognition Time Reduction Algorithm for Large-Vocabulary Speech Recognition (대용량 음성인식을 위한 인식기간 감축 알고리즘)

  • Koo, Jun-Mo;Un, Chong-Kwan;,
    • The Journal of the Acoustical Society of Korea
    • /
    • v.10 no.3
    • /
    • pp.31-36
    • /
    • 1991
  • We propose an efficient pre-classification algorithm extracting candidate words to reduce the recognition time in a large-vocabulary recognition system and also propose the use of spectral and temporal smoothing of the observation probability to improve its classification performance. The proposed algorithm computes the coarse likelihood score for each word in a lexicon using the observation probabilities of speech spectra and duration information of recognition units. With the proposed approach we could reduce the computational amount by 74% with slight degradation of recognition accuracy in 1160-word recognition system based on the phoneme-level HMM. Also, we observed that the proposed coarse likelihood score computation algorithm is a good estimator of the likelihood score computed by the Viterbi algorithm.

  • PDF

Speaker Adaptation Using Linear Transformation Network in Speech Recognition (선형 변환망을 이용한 화자적응 음성인식)

  • 이기희
    • Journal of the Korea Society of Computer and Information
    • /
    • v.5 no.2
    • /
    • pp.90-97
    • /
    • 2000
  • This paper describes an speaker-adaptive speech recognition system which make a reliable recognition of speech signal for new speakers. In the Proposed method, an speech spectrum of new speaker is adapted to the reference speech spectrum by using Parameters of a 1st linear transformation network at the front of phoneme classification neural network. And the recognition system is based on semicontinuous HMM(hidden markov model) which use the multilayer perceptron as a fuzzy vector quantizer. The experiments on the isolated word recognition are performed to show the recognition rate of the recognition system. In the case of speaker adaptation recognition, the recognition rate show significant improvement for the unadapted recognition system.

  • PDF

Acoustic Modeling and Energy-Based Postprocessing for Automatic Speech Segmentation (자동 음성 분할을 위한 음향 모델링 및 에너지 기반 후처리)

  • Park Hyeyoung;Kim Hyungsoon
    • MALSORI
    • /
    • no.43
    • /
    • pp.137-150
    • /
    • 2002
  • Speech segmentation at phoneme level is important for corpus-based text-to-speech synthesis. In this paper, we examine acoustic modeling methods to improve the performance of automatic speech segmentation system based on Hidden Markov Model (HMM). We compare monophone and triphone models, and evaluate several model training approaches. In addition, we employ an energy-based postprocessing scheme to make correction of frequent boundary location errors between silence and speech sounds. Experimental results show that our system provides 71.3% and 84.2% correct boundary locations given tolerance of 10 ms and 20 ms, respectively.

  • PDF

Performance Comparison of Feature Parameters and Classifiers for Speech/Music Discrimination (음성/음악 판별을 위한 특징 파라미터와 분류기의 성능비교)

  • Kim Hyung Soon;Kim Su Mi
    • MALSORI
    • /
    • no.46
    • /
    • pp.37-50
    • /
    • 2003
  • In this paper, we evaluate and compare the performance of speech/music discrimination based on various feature parameters and classifiers. As for feature parameters, we consider High Zero Crossing Rate Ratio (HZCRR), Low Short Time Energy Ratio (LSTER), Spectral Flux (SF), Line Spectral Pair (LSP) distance, entropy and dynamism. We also examine three classifiers: k Nearest Neighbor (k-NN), Gaussian Mixure Model (GMM), and Hidden Markov Model (HMM). According to our experiments, LSP distance and phoneme-recognizer-based feature set (entropy and dunamism) show good performance, while performance differences due to different classifiers are not significant. When all the six feature parameters are employed, average speech/music discrimination accuracy up to 96.6% is achieved.

  • PDF

Development of Embedded Fast/Light Phoneme Recognizer for Distributed Speech Recognition (분산음성인식을 위한 내장형 고속/경량 음소인식기 개발)

  • Kim, Seung-Hi;Hwang, Kyu-Woong;Jeon, Hyun-Bae;Jeong, Hoon;Park, Jun
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2007.05a
    • /
    • pp.395-396
    • /
    • 2007
  • ETRI 음성/언어정보연구센터에서는 분산음성인식을 위해 메모리를 작게 사용하며 속도가 빠른 음소인식기를 개발 중이다. 음향 모델, 언어 모델, 탐색 네트워크 등 고정되어 있는 정보는 인식기를 수행하기 이전에 미리 binary 형태로 구축하여 ROM 형태로 저장함으로써 실제 사용해야 할 RAM 용량을 대폭 줄일 수 있었다. Tied state에 기반한 triphone 모델에서는 unique HMM 만을 사용함으로써 인식시간 및 메모리 사용량을 대폭 줄일 수 있었다. Monophone 인식기의 경우 RAM 사용량이 179KB였으며, triphone 인식기의 경우 435KB의 RAM 사용량과 RTF(Real Time Factor) 0.02를 확인하였다.

  • PDF

Phoneme-Model Word Recognizer on RASTA-PLP (RASTA-PLP의 음소 모델 단어 인식기 적용)

  • 허창원
    • Proceedings of the Acoustical Society of Korea Conference
    • /
    • 1997.06a
    • /
    • pp.9-12
    • /
    • 1997
  • 대부분의 음성 파?너 추정 기법은 통신 채널의 주파수 응답에 의해 쉽게 영향을 받는다. 이 논문에서 우리는 음성에서 그러한 안정상태의 스펙트럼 계수에 있어서 좀더 강인한 기법인 RASTA-PLP 방법을 적용하여 파라미터를 추출하고 그 파라미터를 연속 HMM 인식기의 입력으로 사용하여 문맥독립 음소 모델을 훈련하는 과정에서 최적의 모델을 찾게 된다. 여기서는 ETRI 445 DB에 RASTA-PLP를 적용하였을 때 가장 좋은 성능을 나타내는 재추정 횟수와 mixutre 수를 찾는 데 목표를둔다. 문맥독립음소모델은 한국어의 발성학적 근거를 토대로 하고 여기에 묵음(silence)을 추가하여 총 40개로 정의하였다. 문맥독립 음소모델은 3개의 상태를 가지는 전형적인 left-to right CHMM(Continuous Hidden Markov Model)을 이용하여 훈련한다. 그리고 훈련시간을 줄이기 위해 Viterbi beam 탐색법을 적용한다.

  • PDF

A Korean Flight Reservation System Using Continuous Speech Recognition

  • Choi, Jong-Ryong;Kim, Bum-Koog;Chung, Hyun-Yeol;Nakagawa, Seiichi
    • The Journal of the Acoustical Society of Korea
    • /
    • v.15 no.3E
    • /
    • pp.60-65
    • /
    • 1996
  • This paper describes on the Korean continuous speech recognition system for flight reservation. It adopts a frame-synchronous One-Pass DP search algorithm driven by syntactic constraints of context free grammar(CFG). For recognition, 48 phoneme-like units(PLU) were defined and used as basic units for acoustic modeling of Korean. This modeling was conducted using a HMM technique, where each model has 4-states 3-continuous output probability distributions and 3-discrete-duration distributions. Language modeling by CFG was also applied to the task domain of flight reservation, which consisted of 346 words and 422 rewriting rules. In the tests, the sentence recognition rate of 62.6% was obtained after speaker adaptation.

  • PDF