• Title/Summary/Keyword: Speech Synthesis

Search Result 381, Processing Time 0.029 seconds

Spectrum Based Excitation Extraction for HMM Based Speech Synthesis System (스펙트럼 기반 여기신호 추출을 통한 HMM기반 음성합성기의 음질 개선 방법)

  • Lee, Bong-Jin;Kim, Seong-Woo;Baek, Soon-Ho;Kim, Jong-Jin;Kang, Hong-Goo
    • The Journal of the Acoustical Society of Korea
    • /
    • v.29 no.1
    • /
    • pp.82-90
    • /
    • 2010
  • This paper proposes an efficient method to enhance the quality of synthesized speech in HMM based speech synthesis system. The proposed method trains spectral parameters and excitation signals using Gaussian mixture model, and estimates appropriate excitation signals from spectral parameters during the synthesis stage. Both WB-PESQ and MUSHRA results show that the proposed method provides better speech quality than conventional HMM based speech synthesis system.

A Study on a Searching, Extraction and Approximation-Synthesis of Transition Segment in Continuous Speech (연속음성에서 천이구간의 탐색, 추출, 근사합성에 관한 연구)

  • Lee, Si-U
    • The Transactions of the Korea Information Processing Society
    • /
    • v.7 no.4
    • /
    • pp.1299-1304
    • /
    • 2000
  • In a speed coding system using excitation source of voiced and unvoiced, it would be involved a distortion of speech quality in case coexist with a voiced and an unvoiced consonants in a frame. So, I propose TSIUVC(Transition Segment Including UnVoiced Consonant) searching, extraction ad approximation-synthesis method in order to uncoexistent with a voiced and unvoiced consonants in a frame. This method based on a zerocrossing rate and pitch detector using FIR-STREAK Digital Filter. As a result, the extraction rates of TSIUVC are 84.8% (plosive), 94.9%(fricative), 92.3%(affricative) in female voice, and 88%(plosive), 94.9%(fricative), 92.3%(affricative) in male voice respectively, Also, I obain a high quality approximation-synthesis waveforms within TSIUVC by using frequency information of 0.547kHz below and 2.813kHz above. This method has the capability of being applied to speech coding of low bit rate, speech analysis and speech synthesis.

  • PDF

Speech Transition Detection and approximate-synthesis Method for Speech Signal Compression and Recovery (음성신호 압축 및 복원을 위한 음성 천이구간 검출과 근사합성 방식)

  • Lee, Kwang-Seok;Kim, Bong-Gi;Kang, Seong-Soo;Kim, Hyun-Deok
    • Proceedings of the Korean Institute of Information and Commucation Sciences Conference
    • /
    • 2008.05a
    • /
    • pp.763-767
    • /
    • 2008
  • In a speech coding system using excitation source of voiced and unvoiced, it would be involved a distortion of speech qualify in case coexist with a voiced and an unvoiced consonants in a frame. So, We proposed TS(Transition Segment) including unvoiced consonant searching and extraction method in order to uncoexistent with a voiced and unvoiced consonants in a frame. This research present a new method of TS approximate-synthesis by using Least Mean Square and frequency band division. As a result, this method obtain a high quality approximation-synthesis waveforms within TS by using frequency information of 0.547kHz below and 2.813kHz above. The important thing is that the maximum error signal can be made with low distortion approximation-synthesis waveform within TS. This method has the capability of being applied to a new speech coding of Voiced/Silence/TS, speech analysis and speech synthesis.

  • PDF

Speech Signal Compression and Recovery Using Transition Detection and Approximate-Synthesis (천이구간 추출 및 근사합성에 의한 음성신호 압축과 복원)

  • Lee, Kwang-Seok;Lee, Byeong-Ro
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.13 no.2
    • /
    • pp.413-418
    • /
    • 2009
  • In a speech coding system using excitation source of voiced and unvoiced, it would be involved a distortion of speech qualify in case coexist with a voiced and an unvoiced consonants in a frame. So, We proposed TS(Transition Segment) including unvoiced consonant searching and extraction method in order to uncoexistent with a voiced and unvoiced consonants in a frame. This research present a new method of TS approximate-synthesis by using Least Mean Square and frequency band division. As a result, this method obtain a high qualify approximation-synthesis waveforms within TS by using frequency information of 0.547kHz below and 2.813kHz above. The important thing is that the maximum error signal can be made with low distortion approximation-synthesis waveform within TS. This method has the capability of being applied to a new speech coding of Voiced/Silence/TS, speech analysis and speech synthesis.

A Study on Approximation-Synthesis of Transition Segment in Speech Signal (음성신호에서 천이구간의 근사합성에 관한 연구)

  • Lee See-Woo
    • The Journal of the Korea Contents Association
    • /
    • v.5 no.3
    • /
    • pp.167-173
    • /
    • 2005
  • In a speech coding system using excitation source of voiced and unvoiced, it would be involved a distortion of speech quality in case coexist with a voiced and unvoiced consonants in a frame. So, I propose TSIUVC(Transition Segment Including Unvoiced Consonant) extraction method by using pitch pulses and Zero Crossing Rate in order to unexistent with a voiced and unvoiced consonants in a frame. And this paper present a TSIUVC approximate-synthesis method by using frequency band division. As a result, this method obtains a high quality approximation-synthesis waveform within TSIUVC by using frequency information of 0.547kHz below and 2.813kHz above. And the TSIUVC extraction rate was $91\%$ for female voice and $96.2\%$ for male voice respectively This method has the capability of being applied to a new speech coding of Voiced/Silence/TSIUVC, speech analysis, and speech synthesis.

  • PDF

A Study on Multi-Pulse Speech Coding Method by using Selected Information in a Frequency Domain (주파수 영역의 선택정보를 이용한 멀티펄스 음성부호화 방식에 관한 연구)

  • Lee See-Woo
    • Journal of Internet Computing and Services
    • /
    • v.7 no.4
    • /
    • pp.57-66
    • /
    • 2006
  • In this paper, I propose a new method of Multi-Pulse Speech Coding(FBD-MPC: Frequency Band Division MPC) by using TSIUVC(Transition Segment Including UnVoiced Consonant) searching, extraction and approximation-synthesis method in a frequency domain. As, a result. the extraction rates of TSIUVC are 84.8%(plosive), 94.9%(fricative) and 92.3%(affricative) in female voice, 88%(plosive), 94.9%(fricative) and 92.3%(affricative) in male voice respectively. Also, I obtain a high quality approximation-synthesis waveforms within TSIUVC by using frequency information of 0.547kHz below and 2.813kHz above. I evaluate MPC by using switching information of voiced/unvoiced and FBD-MPC by using switching information of voiced/Silence/TSIUVC. As, a result, I knew that synthesis speech of FBD-MPC was better in speech quality than synthesis speech of the MPC.

  • PDF

Real-time implementation and performance evaluation of speech classifiers in speech analysis-synthesis

  • Kumar, Sandeep
    • ETRI Journal
    • /
    • v.43 no.1
    • /
    • pp.82-94
    • /
    • 2021
  • In this work, six voiced/unvoiced speech classifiers based on the autocorrelation function (ACF), average magnitude difference function (AMDF), cepstrum, weighted ACF (WACF), zero crossing rate and energy of the signal (ZCR-E), and neural networks (NNs) have been simulated and implemented in real time using the TMS320C6713 DSP starter kit. These speech classifiers have been integrated into a linear-predictive-coding-based speech analysis-synthesis system and their performance has been compared in terms of the percentage of the voiced/unvoiced classification accuracy, speech quality, and computation time. The results of the percentage of the voiced/unvoiced classification accuracy and speech quality show that the NN-based speech classifier performs better than the ACF-, AMDF-, cepstrum-, WACF- and ZCR-E-based speech classifiers for both clean and noisy environments. The computation time results show that the AMDF-based speech classifier is computationally simple, and thus its computation time is less than that of other speech classifiers, while that of the NN-based speech classifier is greater compared with other classifiers.

A Spectral Smoothing Algorithm for Unit Concatenating Speech Synthesis (코퍼스 기반 음성합성기를 위한 합성단위 경계 스펙트럼 평탄화 알고리즘)

  • Kim Sang-Jin;Jang Kyung Ae;Hahn Minsoo
    • MALSORI
    • /
    • no.56
    • /
    • pp.225-235
    • /
    • 2005
  • Speech unit concatenation with a large database is presently the most popular method for speech synthesis. In this approach, the mismatches at the unit boundaries are unavoidable and become one of the reasons for quality degradation. This paper proposes an algorithm to reduce undesired discontinuities between the subsequent units. Optimal matching points are calculated in two steps. Firstly, the fullback-Leibler distance measurement is utilized for the spectral matching, then the unit sliding and the overlap windowing are used for the waveform matching. The proposed algorithm is implemented for the corpus-based unit concatenating Korean text-to-speech system that has an automatically labeled database. Experimental results show that our algorithm is fairly better than the raw concatenation or the overlap smoothing method.

  • PDF

Analysis of the Timing of Spoken Korean Using a Classification and Regression Tree (CART) Model

  • Chung, Hyun-Song;Huckvale, Mark
    • Speech Sciences
    • /
    • v.8 no.1
    • /
    • pp.77-91
    • /
    • 2001
  • This paper investigates the timing of Korean spoken in a news-reading speech style in order to improve the naturalness of durations used in Korean speech synthesis. Each segment in a corpus of 671 read sentences was annotated with 69 segmental and prosodic features so that the measured duration could be correlated with the context in which it occurred. A CART model based on the features showed a correlation coefficient of 0.79 with an RMSE (root mean squared prediction error) of 23 ms between actual and predicted durations in reserved test data. These results are comparable with recent published results in Korean and similar to results found in other languages. An analysis of the classification tree shows that phrasal structure has the greatest effect on the segment duration, followed by syllable structure and the manner features of surrounding segments. The place features of surrounding segments only have small effects. The model has application in Korean speech synthesis systems.

  • PDF

Control of Duration Model Parameters in HMM-based Korean Speech Synthesis (HMM 기반의 한국어 음성합성에서 지속시간 모델 파라미터 제어)

  • Kim, Il-Hwan;Bae, Keun-Sung
    • Speech Sciences
    • /
    • v.15 no.4
    • /
    • pp.97-105
    • /
    • 2008
  • Nowadays an HMM-based text-to-speech system (HTS) has been very widely studied because it needs less memory and low computation complexity and is suitable for embedded systems in comparison with a corpus-based unit concatenation text-to-speech one. It also has the advantage that voice characteristics and the speaking rate of the synthetic speech can be converted easily by modifying HMM parameters appropriately. We implemented an HMM-based Korean text-to-speech system using a small size Korean speech DB and proposes a method to increase the naturalness of the synthetic speech by controlling duration model parameters in the HMM-based Korean text-to speech system. We performed a paired comparison test to verify that theses techniques are effective. The test result with the preference scores of 73.8% has shown the improvement of the naturalness of the synthetic speech through controlling the duration model parameters.

  • PDF