• Title/Summary/Keyword: Korean text-to-speech system

Search Result 153, Processing Time 0.023 seconds

Control of Duration Model Parameters in HMM-based Korean Speech Synthesis (HMM 기반의 한국어 음성합성에서 지속시간 모델 파라미터 제어)

  • Kim, Il-Hwan;Bae, Keun-Sung
    • Speech Sciences
    • /
    • v.15 no.4
    • /
    • pp.97-105
    • /
    • 2008
  • Nowadays an HMM-based text-to-speech system (HTS) has been very widely studied because it needs less memory and low computation complexity and is suitable for embedded systems in comparison with a corpus-based unit concatenation text-to-speech one. It also has the advantage that voice characteristics and the speaking rate of the synthetic speech can be converted easily by modifying HMM parameters appropriately. We implemented an HMM-based Korean text-to-speech system using a small size Korean speech DB and proposes a method to increase the naturalness of the synthetic speech by controlling duration model parameters in the HMM-based Korean text-to speech system. We performed a paired comparison test to verify that theses techniques are effective. The test result with the preference scores of 73.8% has shown the improvement of the naturalness of the synthetic speech through controlling the duration model parameters.

  • PDF

An end-to-end synthesis method for Korean text-to-speech systems (한국어 text-to-speech(TTS) 시스템을 위한 엔드투엔드 합성 방식 연구)

  • Choi, Yeunju;Jung, Youngmoon;Kim, Younggwan;Suh, Youngjoo;Kim, Hoirin
    • Phonetics and Speech Sciences
    • /
    • v.10 no.1
    • /
    • pp.39-48
    • /
    • 2018
  • A typical statistical parametric speech synthesis (text-to-speech, TTS) system consists of separate modules, such as a text analysis module, an acoustic modeling module, and a speech synthesis module. This causes two problems: 1) expert knowledge of each module is required, and 2) errors generated in each module accumulate passing through each module. An end-to-end TTS system could avoid such problems by synthesizing voice signals directly from an input string. In this study, we implemented an end-to-end Korean TTS system using Google's Tacotron, which is an end-to-end TTS system based on a sequence-to-sequence model with attention mechanism. We used 4392 utterances spoken by a Korean female speaker, an amount that corresponds to 37% of the dataset Google used for training Tacotron. Our system obtained mean opinion score (MOS) 2.98 and degradation mean opinion score (DMOS) 3.25. We will discuss the factors which affected training of the system. Experiments demonstrate that the post-processing network needs to be designed considering output language and input characters and that according to the amount of training data, the maximum value of n for n-grams modeled by the encoder should be small enough.

Building an Exceptional Pronunciation Dictionary For Korean Automatic Pronunciation Generator (한국어 자동 발음열 생성을 위한 예외발음사전 구축)

  • Kim, Sun-Hee
    • Speech Sciences
    • /
    • v.10 no.4
    • /
    • pp.167-177
    • /
    • 2003
  • This paper presents a method of building an exceptional pronunciation dictionary for Korean automatic pronunciation generator. An automatic pronunciation generator is an essential element of speech recognition system and a TTS (Text-To-Speech) system. It is composed of a part of regular rules and an exceptional pronunciation dictionary. The exceptional pronunciation dictionary is created by extracting the words which have exceptional pronunciations from text corpus based on the characteristics of the words of exceptional pronunciation through phonological research and text analysis. Thus, the method contributes to improve performance of Korean automatic pronunciation generator as well as the performance of speech recognition system and TTS system.

  • PDF

Implementation of Music Broadcasting Service System in the Shopping Center Using Text-To-Speech Technology (TTS를 이용한 매장 음악 방송 서비스 시스템 구현)

  • Chang, Moon-Soo;Kang, Sun-Mee
    • Speech Sciences
    • /
    • v.14 no.4
    • /
    • pp.169-178
    • /
    • 2007
  • This thesis describes the development of a service system for small-sized shops which support not only music broadcasting, but editing and generating voice announcement using the TTS(Text-To-Speech) technology. The system has been developed based on web environments with an easy access whenever and wherever it is needed. The system is able to control the sound using silverlight media player based on the ASP .NET 2.0 technology without any additional application software. Use of the Ajax control allows for multiple users to get the maximum load when needed. TTS is built in the server side so that the service can be provided without user's computer. Due to convenience and usefulness of the system, the business sector can provide better service to many shops. Further additional functions such as statistical analysis will undoubtedly help shop management provide desirable services.

  • PDF

AP, IP Prediction For Corpus-based Korean Text-To-Speech (코퍼스 방식 음성합성에서의 개선된 운율구 경계 예측)

  • Kwon, O-Hil;Hong, Mun-Ki;Kang, Sun-Mee;Shin, Ji-Young
    • Speech Sciences
    • /
    • v.9 no.3
    • /
    • pp.25-34
    • /
    • 2002
  • One of the most important factor in the performance of Korean text-to-speech system is the prediction of accentual and intonational phrase boundary. The previous method of prediction shows only the 75-85% which is not proper in the practical and commercial system. Therefore, more accurate prediction must be needed in the practical system. In this study, we propose the simple and more accurate method of the prediction of AP, IP.

  • PDF

Chinese Prosody Generation Based on C-ToBI Representation for Text-to-Speech (음성합성을 위한 C-ToBI기반의 중국어 운율 경계와 F0 contour 생성)

  • Kim, Seung-Won;Zheng, Yu;Lee, Gary-Geunbae;Kim, Byeong-Chang
    • MALSORI
    • /
    • no.53
    • /
    • pp.75-92
    • /
    • 2005
  • Prosody Generation Based on C-ToBI Representation for Text-to-SpeechSeungwon Kim, Yu Zheng, Gary Geunbae Lee, Byeongchang KimProsody modeling is critical in developing text-to-speech (TTS) systems where speech synthesis is used to automatically generate natural speech. In this paper, we present a prosody generation architecture based on Chinese Tone and Break Index (C-ToBI) representation. ToBI is a multi-tier representation system based on linguistic knowledge to transcribe events in an utterance. The TTS system which adopts ToBI as an intermediate representation is known to exhibit higher flexibility, modularity and domain/task portability compared with the direct prosody generation TTS systems. However, the cost of corpus preparation is very expensive for practical-level performance because the ToBI labeled corpus has been manually constructed by many prosody experts and normally requires a large amount of data for accurate statistical prosody modeling. This paper proposes a new method which transcribes the C-ToBI labels automatically in Chinese speech. We model Chinese prosody generation as a classification problem and apply conditional Maximum Entropy (ME) classification to this problem. We empirically verify the usefulness of various natural language and phonology features to make well-integrated features for ME framework.

  • PDF

PROSODY CONTROL BASED ON SYNTACTIC INFORMATION IN KOREAN TEXT-TO-SPEECH CONVERSION SYSTEM

  • Kim, Yeon-Jun;Oh, Yung-Hwan
    • Proceedings of the Acoustical Society of Korea Conference
    • /
    • 1994.06a
    • /
    • pp.937-942
    • /
    • 1994
  • Text-to-Speech(TTS) conversion system can convert any words or sentences into speech. To synthesize the speech like human beings do, careful prosody control including intonation, duration, accent, and pause is required. It helps listeners to understand the speech clearly and makes the speech sound more natural. In this paper, a prosody control scheme which makes use of the information of the function word is proposed. Among many factors of prosody, intonation, duration, and pause are closely related to syntactic structure, and their relations have been formalized and embodied in TTS. To evaluate the synthesized speech with the proposed prosody control, one of the subjective evaluation methods-MOS(Mean Opinion Score) method has been used. Synthesized speech has been tested on 10 listeners and each listener scored the speech between 1 and 5. Through the evaluation experiments, it is observed that the proposed prosody control helps TTS system synthesize the more natural speech.

  • PDF

Performance improvement of text-dependent speaker verification system using blind speech segmentation and energy weight (Blind speech segmentation과 에너지 가중치를 이용한 문장 종속형 화자인식기의 성능 향상)

  • Kim Jung-Gon;Kim Hyung Soon
    • MALSORI
    • /
    • no.47
    • /
    • pp.131-140
    • /
    • 2003
  • We propose a new method of generating client models for HMM based text-dependent speaker verification system with only a small amount of training data. To make a client model, statistical methods such as segmental K-means algorithm are widely used, but they do not guarantee the quality or reliability of a model when only limited data are avaliable. In this paper, we propose a blind speech segmentation based on level building DTW algorithm as an alternative method to make a client model with limited data. In addition, considering the fact that voiced sounds have much more speaker-specific information than unvoiced sounds and energy of the former is higher than that of the latter, we also propose a new score evaluation method using the observation probability raised to the power of weighting factor estimated from the normalized log energy. Our experiment shows that the proposed methods are superior to conventional HMM based speaker verification system.

  • PDF

GENERATION OF MULTI-SYLLABLE NONSENSE WORDS FOR THE ASSESSMENT OF KOREAN TEXT-TO SPEECH SYSTEM (한국어 문장음성합성 시스템의 평가를 위한 다음절 무의미단어의 생성 및 평가에 관한 연구)

  • 조철우
    • Proceedings of the Acoustical Society of Korea Conference
    • /
    • 1994.06c
    • /
    • pp.338-341
    • /
    • 1994
  • In this paper we propose a method to generate a multisyllable onsense wordest for the purpose of synthetic speech assessment and applies th ewordest to assess one commercial text-to-speech system. Some results about the experiment is suggested and it is verified that the generated nonsense wordset can be used to assess the intelligibility of the synthesizer in phoneme level or in phonemic environmental level. From the experimental results it is verified that such multi-syllable nonsense wordset can be useful for the assessment of synthesized speech.

  • PDF

Perceptual Evaluation of Duration Models in Spoken Korean

  • Chung, Hyun-Song
    • Speech Sciences
    • /
    • v.9 no.1
    • /
    • pp.207-215
    • /
    • 2002
  • Perceptual evaluation of duration models of spoken Korean was carried out based on the Classification and Regression Tree (CART) model for text-to-speech conversion. A reference set of durations was produced by a commercial text-to-speech synthesis system for comparison. The duration model which was built in the previous research (Chung & Huckvale, 2001) was applied to a Korean language speech synthesis diphone database, 'Hanmal (HN 1.0)'. The synthetic speech produced by the CART duration model was preferred in the subjective preference test by a small margin and the synthetic speech from the commercial system was superior in the clarity test. In the course of preparing the experiment, a labeled database of spoken Korean with 670 sentences was constructed. As a result of the experiment, a trained duration model for speech synthesis was obtained. The 'Hanmal' diphone database for Korean speech synthesis was also developed as a by-product of the perceptual evaluation.

  • PDF