• Title/Summary/Keyword: Vocoder

Search Result 151, Processing Time 0.026 seconds

Performance Comparison of State-of-the-Art Vocoder Technology Based on Deep Learning in a Korean TTS System (한국어 TTS 시스템에서 딥러닝 기반 최첨단 보코더 기술 성능 비교)

  • Kwon, Chul Hong
    • The Journal of the Convergence on Culture Technology
    • /
    • v.6 no.2
    • /
    • pp.509-514
    • /
    • 2020
  • The conventional TTS system consists of several modules, including text preprocessing, parsing analysis, grapheme-to-phoneme conversion, boundary analysis, prosody control, acoustic feature generation by acoustic model, and synthesized speech generation. But TTS system with deep learning is composed of Text2Mel process that generates spectrogram from text, and vocoder that synthesizes speech signals from spectrogram. In this paper, for the optimal Korean TTS system construction we apply Tacotron2 to Tex2Mel process, and as a vocoder we introduce the methods such as WaveNet, WaveRNN, and WaveGlow, and implement them to verify and compare their performance. Experimental results show that WaveNet has the highest MOS and the trained model is hundreds of megabytes in size, but the synthesis time is about 50 times the real time. WaveRNN shows MOS performance similar to that of WaveNet and the model size is several tens of megabytes, but this method also cannot be processed in real time. WaveGlow can handle real-time processing, but the model is several GB in size and MOS is the worst of the three vocoders. From the results of this study, the reference criteria for selecting the appropriate method according to the hardware environment in the field of applying the TTS system are presented in this paper.

Improvement of Overlapped Codebook Search in QCELP (QCELP에서 중첩된 코드북 검색의 개선)

  • 박광철;한승진;이정현
    • The KIPS Transactions:PartC
    • /
    • v.8C no.1
    • /
    • pp.105-112
    • /
    • 2001
  • In this paper, we present the advanced QCELP codebook search improving the qualification of speech, which can make QCELP vocoder used in noise robust system. While conventional QCELP usually searches stochastic codebook once, we can find that two times search is the most suitable for improving the quality of speech after we did 2-5 times search. Consequently, the advanced QCELP vocoder represents excitation signal in detail using two times precise quantization and so improve the qualification of speech. In our experiment, we use the speeches collected from circumstance (such as lecture room, house, street, laboratory etc.) without regarding noise as input dat and measure the speech Qualification using SNR, segSNR. As the result of the experiment, we find that the advanced QCELP makes SNR and segSNR improved by 38.35% and 65.51% respectively compared with conventional QCELP.

  • PDF

On A Reduction of Pitch Searching Time by Preprocessing in the CELP Vocoder (CELP 보코더에서 전처리에 의한 피치검색 시간의 단축)

  • Kim, Dae-Sik;Bae, Myeong-Jin;Kim, Jong-Jae;Byun, Kyung-Jin;Han, Ki-Chun;Yoo, Hah-Young
    • The Journal of the Acoustical Society of Korea
    • /
    • v.13 no.3
    • /
    • pp.33-40
    • /
    • 1994
  • Code Excited Linear Prediction(CELP) speech coders exhibit good performance at data rates below 4.8 kbps. This major drawback of CELP type coders is required much computation. In this paper, we propose a new pitch search method that preserves the quality of the CELP vocoder with reducing complexity. In the pitch searching, we detect the segments of high correlation by a simple preprocessing, and then carry out the pitch searching only for the segments obtained by the preprocessing. By using the proposed method, we can get approximately $77\%$ complexity reduction in the pitch search.

  • PDF

Voice transformation for HTS using correlation between fundamental frequency and vocal tract length (기본주파수와 성도길이의 상관관계를 이용한 HTS 음성합성기에서의 목소리 변환)

  • Yoo, Hyogeun;Kim, Younggwan;Suh, Youngjoo;Kim, Hoirin
    • Phonetics and Speech Sciences
    • /
    • v.9 no.1
    • /
    • pp.41-47
    • /
    • 2017
  • The main advantage of the statistical parametric speech synthesis is its flexibility in changing voice characteristics. A personalized text-to-speech(TTS) system can be implemented by combining a speech synthesis system and a voice transformation system, and it is widely used in many application areas. It is known that the fundamental frequency and the spectral envelope of speech signal can be independently modified to convert the voice characteristics. Also it is important to maintain naturalness of the transformed speech. In this paper, a speech synthesis system based on Hidden Markov Model(HMM-based speech synthesis, HTS) using the STRAIGHT vocoder is constructed and voice transformation is conducted by modifying the fundamental frequency and spectral envelope. The fundamental frequency is transformed in a scaling method, and the spectral envelope is transformed through frequency warping method to control the speaker's vocal tract length. In particular, this study proposes a voice transformation method using the correlation between fundamental frequency and vocal tract length. Subjective evaluations were conducted to assess preference and mean opinion scores(MOS) for naturalness of synthetic speech. Experimental results showed that the proposed voice transformation method achieved higher preference than baseline systems while maintaining the naturalness of the speech quality.

Dynamic Excitation Modeling Scheme Applied for Variable Low Bit-Rate Homomorphic Vocoder (가변 저 전송율 호모몰픽 보코더에 응용된 동적 음원 모델링 기법)

  • 정재호
    • The Journal of Korean Institute of Communications and Information Sciences
    • /
    • v.19 no.12
    • /
    • pp.2479-2488
    • /
    • 1994
  • In this paper, a new dynamic excitation modeling scheme is proposed. Based upon the proposed excitation modeling scheme, two variable bit rate homomorphic vocoders are designed, whose average bit rates are 3.8 Kbps and 4.4 Kbps. The performance of the proposed excitation modeling scheme is then evaluated through the subjective listening tests. In the tests, the performances of two speech coders designed in this paper ate compared with the one of 4.8 Kbps homomorphic vocoder designed by Chung and Schafer, in which conventional static excitation modeling scheme applied. The subjective listening tests show that proposed dynamic excitation modeling scheme improves synthesized speech quality while lowering the average bit rate of speech coders.

  • PDF

A Study of Real-Time Implementation of Audio/Data Processor for Digital/Analog Dual mode Mobile Phone (디지탈/아날로그 겸용 이동통신 단말기를 위한 오디오/데이타 프로세서의 실시간 구현에 관한 연구)

  • Byun, Kyung-Jin;Kim, Jong-Jae;Han, Ki-Chun;Yoo, Hah-Young;Cha, Jin-Jong;Kim, Kyung-Su
    • The Journal of the Acoustical Society of Korea
    • /
    • v.16 no.2
    • /
    • pp.80-88
    • /
    • 1997
  • In this paper, the implementation of audio/data processor using ETRI DSP to support analog mode in digital/analog dual mode mobile phone is presented. Audio/data processor performs the wideband data processing, audio signal processing, demodulation function, and data rate conversion when it is operated in analog mode. These functions are programmed in assembly language, and then loaded to ETRI DSP together with vocoder program for the digital mode operation. This is a very efficient implementation of the dual mode cellular phone ASIC since the vocoder for the digital mode and audio/data processor for the analog mode are programmed together in the same hardware.

  • PDF

A Study on Improving Pitch Search for Vocoder (보코더에서 피치검색 성능개선에 관한 연구)

  • Baek, Geum-Ran;Bae, Myung-Jin
    • The Journal of the Acoustical Society of Korea
    • /
    • v.31 no.7
    • /
    • pp.419-426
    • /
    • 2012
  • The pitch searching is a vital process in a vocoder. Generally, the method of pitch searching is employed after highlighting the periodicity, where a correlation is identified with the signal by changing the interval of two pulses. When the correlation value reaches the peak, the pitch can be found by the pulse interval because it is the repetition interval with most striking period. However if the identified period happens to be one of half period, double period or triple period, this cannot be considered as the pitch period. Many methods were suggested to solve this problem. An inaccurate pitch could be obtained as well, when there is an interval where signal amplitude is not constant but varies abruptly in the frame. To solve this matter, searching the pitch by dividing a frame into various subframes is adopted, but too much calculation has to be followed while it leads the correct value. This paper suggests an algorithm to resolve these two problems. First, to search the pitch after advance correction of the signal energy level with an estimated overall energy change ratio in the frame before pitch search to reduce half period, double period and triple period is suggested. Second, to vary the number of subframes by predicting the amplitude change rate in the frame by the energy ratio obtained by the above-mentioned method is advised. If these two methods are applied, the pitch searching time can be reduced and the general pitch searching performance can be improved without affecting the sound quality in the synthesized signal.

The Performance Improvement of PLC by Using RTP Extension Header Data for Consecutive Frame Loss Condition in CELP Type Vocoder (CELP Type Vocoder에서 RTP 확장 헤더 데이터를 이용한 연속적인 프레임 손실에 대한 PLC 성능개선)

  • Hong, Seong-Hoon;Bae, Myung-Jin
    • The Journal of the Acoustical Society of Korea
    • /
    • v.29 no.1
    • /
    • pp.48-55
    • /
    • 2010
  • It has a falling off in speech quality, especially when consecutive packet loss occurs, even if a vocoder implemented in the packet network has its own packet loss concealment (PLC) algorithm. PLC algorithm is divided into transmitter and receiver algorithm. Algorithm in the transmitter gives superior quality by additional information. however it is impossible to provide mutual compatibility and it occurs extra delay and transmission rate. The method applied in the receiver does not require additional delay. However, it sets limits to improve the speech quality. In this paper, we propose a new method that puts extra information for PLC in a part of Extension Header Data which is not used in RTP Header. It can solve the problem and obtain enhanced speech quality. There is no extra delay occurred by the proposed algorithm because there is a jitter buffer to adjust network delay in a receiver. Extra information, 16 bits each frame for G.729 PLC, is allocated for MA filter index in LP synthesis, excitation signal, excitation signal gain and residual gain reconstruction. It is because a transmitter sends speech data each 20 ms when it transfers RTP payload. As a result, the proposed method shows superior performance about 13.5%.