• Title/Summary/Keyword: natural speech

Search Result 316, Processing Time 0.031 seconds

Speech Synthesis Based on CVC Speech Segments Extracted from Continuous Speech (연속 음성으로부터 추출한 CVC 음성세그먼트 기반의 음성합성)

  • 김재홍;조관선;이철희
    • The Journal of the Acoustical Society of Korea
    • /
    • v.18 no.7
    • /
    • pp.10-16
    • /
    • 1999
  • In this paper, we propose a concatenation-based speech synthesizer using CVC(consonant-vowel-consonant) speech segments extracted from an undesigned continuous speech corpus. Natural synthetic speech can be generated by a proper modelling of coarticulation effects between phonemes and the use of natural prosodic variations. In general, CVC synthesis unit shows smaller acoustic degradation of speech quality since concatenation points are located in the consonant region and it can properly model the coarticulation of vowels that are effected by surrounding consonants. In this paper, we analyze the characteristics and the number of required synthesis units of 4 types of speech synthesis methods that use CVC synthesis units. Furthermore, we compare the speech quality of the 4 types and propose a new synthesis method based on the most promising type in terms of speech quality and implementability. Then we implement the method using the speech corpus and synthesize various examples. The CVC speech segments that are not in the speech corpus are substituted by demonstrate speech segments. Experiments demonstrate that CVC speech segments extracted from about 100 Mbytes continuous speech corpus can produce high quality synthetic speech.

  • PDF

Implementation of Korean TTS System based on Natural Language Processing (자연어 처리 기반 한국어 TTS 시스템 구현)

  • Kim Byeongchang;Lee Gary Geunbae
    • MALSORI
    • /
    • no.46
    • /
    • pp.51-64
    • /
    • 2003
  • In order to produce high quality synthesized speech, it is very important to get an accurate grapheme-to-phoneme conversion and prosody model from texts using natural language processing. Robust preprocessing for non-Korean characters should also be required. In this paper, we analyzed Korean texts using a morphological analyzer, part-of-speech tagger and syntactic chunker. We present a new grapheme-to-phoneme conversion method for Korean using a hybrid method with a phonetic pattern dictionary and CCV (consonant vowel) LTS (letter to sound) rules, for unlimited vocabulary Korean TTS. We constructed a prosody model using a probabilistic method and decision tree-based method. The probabilistic method atone usually suffers from performance degradation due to inherent data sparseness problems. So we adopted tree-based error correction to overcome these training data limitations.

  • PDF

Design and Implementation of a Text-to Speech System using the Prosody and Duration Information (운율 및 길이 정보를 이용한 무제한 음성 합성기의 설계 및 구현)

  • Yang, Jin-Seok;Kim, Jae-Beom;Lee, Jeong-Hyeon
    • The Transactions of the Korea Information Processing Society
    • /
    • v.3 no.5
    • /
    • pp.1121-1129
    • /
    • 1996
  • To produce more natural speech in a Text-to-Speech system, the processing of the prosody and duration must be processing in advance, and then extracted the prosody and duration information by means of trial-and-error experiments. In this paper, a method is proposed to improve the naturalness in a Text-to Speech system using this information. As the results, the Text-to-Speech system proposed and implemented in this paper showed more natural speech synthesis than the systems, which do not use this information, did.

  • PDF

MPEG-4 오디오 기술 동향

  • 한민수;강경옥;변경진
    • Broadcasting and Media Magazine
    • /
    • v.4 no.1
    • /
    • pp.62-79
    • /
    • 1999
  • In this survey paper the emerging MPEG-4 audio technology is discribed In the previous MPEG-1 and the MPEG-4 audio words, only the natural audio and the speech coding techniques were the standadization objects But in the MPEG-4 audio standadization, not only the natural audio and the speech coding, but also the structured audio and the synthetic speech techniques are inclued, The purpose of this expansion can be summarized as the preparation for the versatile high-quality multimedia services supposed emerge in the 21st century.

  • PDF

Chinese Prosody Generation Based on C-ToBI Representation for Text-to-Speech (음성합성을 위한 C-ToBI기반의 중국어 운율 경계와 F0 contour 생성)

  • Kim, Seung-Won;Zheng, Yu;Lee, Gary-Geunbae;Kim, Byeong-Chang
    • MALSORI
    • /
    • no.53
    • /
    • pp.75-92
    • /
    • 2005
  • Prosody Generation Based on C-ToBI Representation for Text-to-SpeechSeungwon Kim, Yu Zheng, Gary Geunbae Lee, Byeongchang KimProsody modeling is critical in developing text-to-speech (TTS) systems where speech synthesis is used to automatically generate natural speech. In this paper, we present a prosody generation architecture based on Chinese Tone and Break Index (C-ToBI) representation. ToBI is a multi-tier representation system based on linguistic knowledge to transcribe events in an utterance. The TTS system which adopts ToBI as an intermediate representation is known to exhibit higher flexibility, modularity and domain/task portability compared with the direct prosody generation TTS systems. However, the cost of corpus preparation is very expensive for practical-level performance because the ToBI labeled corpus has been manually constructed by many prosody experts and normally requires a large amount of data for accurate statistical prosody modeling. This paper proposes a new method which transcribes the C-ToBI labels automatically in Chinese speech. We model Chinese prosody generation as a classification problem and apply conditional Maximum Entropy (ME) classification to this problem. We empirically verify the usefulness of various natural language and phonology features to make well-integrated features for ME framework.

  • PDF

Prosodic Contour Generation for Korean Text-To-Speech System Using Artificial Neural Networks

  • Lim, Un-Cheon
    • The Journal of the Acoustical Society of Korea
    • /
    • v.28 no.2E
    • /
    • pp.43-50
    • /
    • 2009
  • To get more natural synthetic speech generated by a Korean TTS (Text-To-Speech) system, we have to know all the possible prosodic rules in Korean spoken language. We should find out these rules from linguistic, phonetic information or from real speech. In general, all of these rules should be integrated into a prosody-generation algorithm in a TTS system. But this algorithm cannot cover up all the possible prosodic rules in a language and it is not perfect, so the naturalness of synthesized speech cannot be as good as we expect. ANNs (Artificial Neural Networks) can be trained to learn the prosodic rules in Korean spoken language. To train and test ANNs, we need to prepare the prosodic patterns of all the phonemic segments in a prosodic corpus. A prosodic corpus will include meaningful sentences to represent all the possible prosodic rules. Sentences in the corpus were made by picking up a series of words from the list of PB (phonetically Balanced) isolated words. These sentences in the corpus were read by speakers, recorded, and collected as a speech database. By analyzing recorded real speech, we can extract prosodic pattern about each phoneme, and assign them as target and test patterns for ANNs. ANNs can learn the prosody from natural speech and generate prosodic patterns of the central phonemic segment in phoneme strings as output response of ANNs when phoneme strings of a sentence are given to ANNs as input stimuli.

Comparison of Speech Rate and Long-Term Average Speech Spectrum between Korean Clear Speech and Conversational Speech

  • Yoo, Jeeun;Oh, Hongyeop;Jeong, Seungyeop;Jin, In-Ki
    • Journal of Audiology & Otology
    • /
    • v.23 no.4
    • /
    • pp.187-192
    • /
    • 2019
  • Background and Objectives: Clear speech is an effective communication strategy used in difficult listening situations that draws on techniques such as accurate articulation, a slow speech rate, and the inclusion of pauses. Although too slow speech and improperly amplified spectral information can deteriorate overall speech intelligibility, certain amplitude of increments of the mid-frequency bands (1 to 3 dB) and around 50% slower speech rates of clear speech, when compared to those in conversational speech, were reported as factors that can improve speech intelligibility positively. The purpose of this study was to identify whether amplitude increments of mid-frequency areas and slower speech rates were evident in Korean clear speech as they were in English clear speech. Subjects and Methods: To compare the acoustic characteristics of the two methods of speech production, the voices of 60 participants were recorded during conversational speech and then again during clear speech using a standardized sentence material. Results: The speech rate and longterm average speech spectrum (LTASS) were analyzed and compared. Speech rates for clear speech were slower than those for conversational speech. Increased amplitudes in the mid-frequency bands were evident for the LTASS of clear speech. Conclusions:The observed differences in the acoustic characteristics between the two types of speech production suggest that Korean clear speech can be an effective communication strategy to improve speech intelligibility.

Natural-Language-Based Robot Action Control Using a Hierarchical Behavior Model

  • Ahn, Hyunsik;Ko, Hyun-Bum
    • IEIE Transactions on Smart Processing and Computing
    • /
    • v.1 no.3
    • /
    • pp.192-200
    • /
    • 2012
  • In order for humans and robots to interact in daily life, robots need to understand human speech and link it to their actions. This paper proposes a hierarchical behavior model for robot action control using natural language commands. The model, which consists of episodes, primitive actions and atomic functions, uses a sentential cognitive system that includes multiple modules for perception, action, reasoning and memory. Human speech commands are translated to sentences with a natural language processor that are syntactically parsed. A semantic parsing procedure was applied to human speech by analyzing the verbs and phrases of the sentences and linking them to the cognitive information. The cognitive system performed according to the hierarchical behavior model, which consists of episodes, primitive actions and atomic functions, which are implemented in the system. In the experiments, a possible episode, "Water the pot," was tested and its feasibility was evaluated.

  • PDF

A Study on DNN-based STT Error Correction

  • Jong-Eon Lee
    • International journal of advanced smart convergence
    • /
    • v.12 no.4
    • /
    • pp.171-176
    • /
    • 2023
  • This study is about a speech recognition error correction system designed to detect and correct speech recognition errors before natural language processing to increase the success rate of intent analysis in natural language processing with optimal efficiency in various service domains. An encoder is constructed to embedded the correct speech token and one or more error speech tokens corresponding to the correct speech token so that they are all located in a dense vector space for each correct token with similar vector values. One or more utterance tokens within a preset Manhattan distance based on the correct utterance token in the dense vector space for each embedded correct utterance token are detected through an error detector, and the correct answer closest to the detected error utterance token is based on the Manhattan distance. Errors are corrected by extracting the utterance token as the correct answer.

MPEG-4 TTS (Text-to-Speech)

  • 한민수
    • Proceedings of the IEEK Conference
    • /
    • 1999.06a
    • /
    • pp.699-707
    • /
    • 1999
  • It cannot be argued that speech is the most natural interfacing tool between men and machines. In order to realize acceptable speech interfaces, highly advanced speech recognizers and synthesizers are inevitable. Text-to-Speech(TTS) technology has been attracting a lot of interest among speech engineers because of its own benefits. Namely, the possible application areas of talking computers, emergency alarming systems in speech, speech output devices fur speech-impaired, and so on. Hence, many researchers have made significant progresses in the speech synthesis techniques in the sense of their own languages and as a result, the quality of currently available speech synthesizers are believed to be acceptable to normal users. These are partly why the MPEG group had decided to include the TTS technology as one of its MPEG-4 functionalities. ETRI has made major contributions to the current MPEG-4 TTS among various MPEG-4 functionalities. They are; 1) use of original prosody for synthesized speech output, 2) trick mode functions fer general users without breaking synthesized speech prosody, 3) interoperability with Facial Animation(FA) tools, and 4) dubbing a moving/animated picture with lib-shape pattern information.

  • PDF