• Title/Summary/Keyword: speech rates

Search Result 271, Processing Time 0.02 seconds

The Study on Korean Prosody Generation using Artificial Neural Networks (인공 신경망의 한국어 운율 발생에 관한 연구)

  • Min Kyung-Joong;Lim Un-Cheon
    • Proceedings of the Acoustical Society of Korea Conference
    • /
    • spring
    • /
    • pp.337-340
    • /
    • 2004
  • The exactly reproduced prosody of a TTS system is one of the key factors that affect the naturalness of synthesized speech. In general, rules about prosody had been gathered either from linguistic knowledge or by analyzing the prosodic information from natural speech. But these could not be perfect and some of them could be incorrect. So we proposed artificial neural network(ANN)s that can be trained to team the prosody of natural speech and generate it. In learning phase, let ANNs learn the pitch and energy contour of center phoneme by applying a string of phonemes in a sentence to ANNs and comparing the output pattern with target pattern and making adjustment in weighting values to get the least mean square error between them. In test phase, the estimation rates were computed. We saw that ANNs could generate the prosody of a sentence.

  • PDF

Feature Extraction by Optimizing the Cepstral Resolution of Frequency Sub-bands (주파수 부대역의 켑스트럼 해상도 최적화에 의한 특징추출)

  • 지상문;조훈영;오영환
    • The Journal of the Acoustical Society of Korea
    • /
    • v.22 no.1
    • /
    • pp.35-41
    • /
    • 2003
  • Feature vectors for conventional speech recognition are usually extracted in full frequency band. Therefore, each sub-band contributes equally to final speech recognition results. In this paper, feature Teeters are extracted indepedently in each sub-band. The cepstral resolution of each sub-band feature is controlled for the optimal speech recognition. For this purpose, different dimension of each sub-band ceptral vectors are extracted based on the multi-band approach, which extracts feature vector independently for each sub-band. Speech recognition rates and clustering quality are suggested as the criteria for finding the optimal combination of sub-band Teeter dimension. In the connected digit recognition experiments using TIDIGITS database, the proposed method gave string accuracy of 99.125%, 99.775% percent correct, and 99.705% percent accuracy, which is 38%, 32% and 37% error rate reduction relative to baseline full-band feature vector, respectively.

Improved Harmonic-CELP Speech Coder with Dual Bit-Rates(2.4/4.0 kbps) (이중 전송률(2.4/4.0 kbps)을 갖는 개선된 하모닉-CELP 음성부호화기)

  • 김경민;윤성완;최용수;박영철;윤대희;강태익
    • The Journal of Korean Institute of Communications and Information Sciences
    • /
    • v.28 no.3C
    • /
    • pp.239-247
    • /
    • 2003
  • This paper presents a dual-rate (2.4/4.0 kbps) Improved Harmonic-CELP(IHC) speech coder based on the EHC(Efficient Harmonic-CELP) which was presented by the authors. The proposed IHC employs the harmonic coding for voiced and the CELP for unvoiced segments. In the IHC, an initial voiced/unvoiced estimate is obtained by the pitch gain and energy. Then, the final V/UV mode is decided by using the frame energy contour. A new harmonic estimation combining peak picking and delta adjustment provides a more reliable harmonic estimation than that in the EHC. In addition, a noise mixing scheme in conjunction with an improved band voicing measurement provides the naturalness of the synthesized speech. To demonstrate the performance of the proposed IHC coder, the coder has been implemented and compared with the 2.0/4.0 kbps HVXC(Harmonic excitation Vector Coding) standardized by MPEG-4. Results of subjective evaluation showed that the proposed IHC coder and produce better speech quality than the HVXC, with only 40% complexity of the HVXC.

Phonetic Transcription based Speech Recognition using Stochastic Matching Method (확률적 매칭 방법을 사용한 음소열 기반 음성 인식)

  • Kim, Weon-Goo
    • Journal of the Korean Institute of Intelligent Systems
    • /
    • v.17 no.5
    • /
    • pp.696-700
    • /
    • 2007
  • A new method that improves the performance of the phonetic transcription based speech recognition system is presented with the speaker-independent phonetic recognizer. Since SI phoneme HMM based speech recognition system uses only the phoneme transcription of the input sentence, the storage space could be reduced greatly. However, the performance of the system is worse than that of the speaker dependent system due to the phoneme recognition errors generated from using SI models. A new training method that iteratively estimates the phonetic transcription and transformation vectors is presented to reduce the mismatch between the training utterances and a set of SI models using speaker adaptation techniques. For speaker adaptation the stochastic matching methods are used to estimate the transformation vectors. The experiments performed over actual telephone line shows that a reduction of about 45% in the error rates could be achieved as compared to the conventional method.

An Improvement of Stochastic Feature Extraction for Robust Speech Recognition (강인한 음성인식을 위한 통계적 특징벡터 추출방법의 개선)

  • 김회린;고진석
    • The Journal of the Acoustical Society of Korea
    • /
    • v.23 no.2
    • /
    • pp.180-186
    • /
    • 2004
  • The presence of noise in speech signals degrades the performance of recognition systems in which there are mismatches between the training and test environments. To make a speech recognizer robust, it is necessary to compensate these mismatches. In this paper, we studied about an improvement of stochastic feature extraction based on band-SNR for robust speech recognition. At first, we proposed a modified version of the multi-band spectral subtraction (MSS) method which adjusts the subtraction level of noise spectrum according to band-SNR. In the proposed method referred as M-MSS, a noise normalization factor was newly introduced to finely control the over-estimation factor depending on the band-SNR. Also, we modified the architecture of the stochastic feature extraction (SFE) method. We could get a better performance when the spectral subtraction was applied in the power spectrum domain than in the mel-scale domain. This method is denoted as M-SFE. Last, we applied the M-MSS method to the modified stochastic feature extraction structure, which is denoted as the MMSS-MSFE method. The proposed methods were evaluated on isolated word recognition under various noise environments. The average error rates of the M-MSS, M-SFE, and MMSS-MSFE methods over the ordinary spectral subtraction (SS) method were reduced by 18.6%, 15.1%, and 33.9%, respectively. From these results, we can conclude that the proposed methods provide good candidates for robust feature extraction in the noisy speech recognition.

Effects of Lecturer Appearance and Speech Rate on Learning Flow and Teaching Presence in Video Learning (동영상 학습에서 교수자 출연여부와 발화속도가 학습몰입과 교수실재감에 미치는 효과)

  • Tai, Xiao-Xia;Zhu, Hui-Qin;Kim, Bo-Kyeong
    • Journal of the Korea Academia-Industrial cooperation Society
    • /
    • v.22 no.1
    • /
    • pp.267-274
    • /
    • 2021
  • The purpose of this study is to investigate differences in learning flow and teaching presence according to the lecturer's appearance and the lecturer's speech rate. For this experiment, 183 freshman students from Xingtai University in China were selected as subjects of the experiment, and a total of four types of lecture videos were developed to test the lecturer's appearance and their speech rates. Data was analyzed through multivariate analysis of variance. According to the results of the analysis, first, learning flow and teaching presence of groups who experienced the presence of the lecturer appeared were significantly higher than the groups who learned without the appearance of the lecturer. Second, the groups who learned from videos with a fast speech rate showed higher learning flow and teaching presence than the group who learned at a slow speech rate. Third, there were no significant differences in both learning flow and teaching presence according to the lecturer's appearance and speech rate. This result provides a theoretical and practical basis for developing customized videos according to learners' characteristics.

The Analysis of Quantified Characteristics of Actor Gangho Song's Speech Disfluency rates in Thirst(2009) (박쥐(2009) 배우 송강호 대사에서 비유창성 비율의 정량적 특징 분석)

  • Nam, Seung Suk
    • Proceedings of the Korea Contents Association Conference
    • /
    • 2011.05a
    • /
    • pp.311-312
    • /
    • 2011
  • 이번 연구 목적은 배우 대사의 비유창성 비율 증감이 영화 내러티브 흐름과 어떤 상관 관계를 가지는지 전사작업을 통해 정량적으로 분석하는 것이다. 대화체는 잡음,간투어 등의 비유창성(DFs; Disfluencies) 이 나타난다. 배우는 자연스럽게 대사를 발화하려고 비유창성 비율을 인위적으로 높인다. 연구 대상은 박쥐(2009)의 배우 송강호(상현 역) 대사이다. 가톨릭 신부 상현이 뱀파이어가 되는 인물의 대사를 비유창성 특질들과 전사하였다. 그리고 주인공의 정체성에 따라 배우의 대사의 비유창성 비율의 변화 양상을 비교 분석하였다.

  • PDF

Improving The Excitation Signal for Low-rate CELP Speech Coding (저전송속도 CELP 부호화기에서 여기신호의 개선)

  • 권철홍
    • Proceedings of the Acoustical Society of Korea Conference
    • /
    • 1998.08a
    • /
    • pp.136-141
    • /
    • 1998
  • In order to enhance the performance of a CELP coder at low bit rates, it would be necessary to make the CELP excitation have the peaky pulse characteristic. In this paper we introduce an excitation signal with peaky pulse characteristic. It is obtained by using a two-tap pitch predictor. Samples of the signal have different gains according to their amplitudes by the predictor. In voiced sound the signal has the desirable peaky pulse characteristic, and its periodicity is well reproduced. Particularly, peaky pulses at voiced onset and a burst of plosive sound are clearly reconstructed.

  • PDF

Double Compensation Framework Based on GMM For Speaker Recognition (화자 인식을 위한 GMM기반의 이중 보상 구조)

  • Kim Yu-Jin;Chung Jae-Ho
    • MALSORI
    • /
    • no.45
    • /
    • pp.93-105
    • /
    • 2003
  • In this paper, we present a single framework based on GMM for speaker recognition. The proposed framework can simultaneously minimize environmental variations on mismatched conditions and adapt the bias free and speaker-dependent characteristics of claimant utterances to the background GMM to create a speaker model. We compare the closed-set speaker identification for conventional method and the proposed method both on TIMIT and NTIMIT. In the several sets of experiments we show the improved recognition rates on a simulated channel and a telephone channel condition by 7.2% and 27.4% respectively.

  • PDF

On a Reduction of Codebook Searching Time by using RPE Searching Tchnique in the CELP Vocoder (RPE 검색을 이용한 CELP 보코더의 불규칙 코드북 검색)

  • 김대식
    • Proceedings of the Acoustical Society of Korea Conference
    • /
    • 1995.06a
    • /
    • pp.141-145
    • /
    • 1995
  • Code excited linear prediction speech coders exhibit good performance at data rates as low as 4800 bps. The major drawback to CELP type coders is their large computational requirements. In this paper, we propose a new codebook search method that preserves the quality of the CELP vocoder with reduced complexity. The basic idea is to restrict the searching range of the random codebook by using a searching technique of the regular pulse excitation. Applying the proposed method to the CELP vocoder, we can get approximately 48% complexity reduction in the codebook search.

  • PDF