• Title/Summary/Keyword: Text-to-speech

Search Result 501, Processing Time 0.025 seconds

Speaker Identification using Phonetic GMM (음소별 GMM을 이용한 화자식별)

  • Kwon Sukbong;Kim Hoi-Rin
    • Proceedings of the KSPS conference
    • /
    • 2003.10a
    • /
    • pp.185-188
    • /
    • 2003
  • In this paper, we construct phonetic GMM for text-independent speaker identification system. The basic idea is to combine of the advantages of baseline GMM and HMM. GMM is more proper for text-independent speaker identification system. In text-dependent system, HMM do work better. Phonetic GMM represents more sophistgate text-dependent speaker model based on text-independent speaker model. In speaker identification system, phonetic GMM using HMM-based speaker-independent phoneme recognition results in better performance than baseline GMM. In addition to the method, N-best recognition algorithm used to decrease the computation complexity and to be applicable to new speakers.

  • PDF

Normalization in Collection Procedures of Emotional Speech by Scriptual Context (대본 내용에 의한 정서음성 수집과정의 정규화에 대하여)

  • Jo Cheol-Woo
    • Proceedings of the KSPS conference
    • /
    • 2006.05a
    • /
    • pp.123-125
    • /
    • 2006
  • One of the biggest problems unsolved in emotional speech acquisition is how to make or find a situation which is close to natual or desired state from humans. We proposed a method to collect emotional speech data by scriptual context. Several contexts from the scripts of drama were chosen by the experts in the area. Context were divided into 6 classes according to the contents. Two actors, one male and one female, read the text after recognizing the emotional situations in the script.

  • PDF

Sums-of-Products Models for Korean Segment Duration Prediction

  • Chung, Hyun-Song
    • Speech Sciences
    • /
    • v.10 no.4
    • /
    • pp.7-21
    • /
    • 2003
  • Sums-of-Products models were built for segment duration prediction of spoken Korean. An experiment for the modelling was carried out to apply the results to Korean text-to-speech synthesis systems. 670 read sentences were analyzed. trained and tested for the construction of the duration models. Traditional sequential rule systems were extended to simple additive, multiplicative and additive-multiplicative models based on Sums-of-Products modelling. The parameters used in the modelling include the properties of the target segment and its neighbors and the target segment's position in the prosodic structure. Two optimisation strategies were used: the downhill simplex method and the simulated annealing method. The performance of the models was measured by the correlation coefficient and the root mean squared prediction error (RMSE) between actual and predicted duration in the test data. The best performance was obtained when the data was trained and tested by ' additive-multiplicative models. ' The correlation for the vowel duration prediction was 0.69 and the RMSE. 31.80 ms. while the correlation for the consonant duration prediction was 0.54 and the RMSE. 29.02 ms. The results were not good enough to be applied to the real-time text-to-speech systems. Further investigation of feature interactions is required for the better performance of the Sums-of-Products models.

  • PDF

Algorithm for Concatenating Multiple Phonemic Units for Small Size Korean TTS Using RE-PSOLA Method

  • Bak, Il-Suh;Jo, Cheol-Woo
    • Speech Sciences
    • /
    • v.10 no.1
    • /
    • pp.85-94
    • /
    • 2003
  • In this paper an algorithm to reduce the size of Text-to-Speech database is proposed. The algorithm is based on the characteristics of Korean phonemic units. From the initial database, a reduced phoneme unit set is induced by articulatory similarity of concatenating phonemes. Speech data is read by one female announcer for 1000 phonetically balanced sentences. All the recorded speech is then segmented by phoneticians. Total size of the original speech data is about 640 MB including laryngograph signal. To synthesize wave, RE-PSOLA (Residual-Excited Pitch Synchronous Overlap and Add Method) was used. The voice quality of synthesized speech was compared with original speech in terms of spectrographic informations and objective tests. The quality of the synthesized speech is not much degraded when the size of synthesis DB was reduced from 320 MB to 82 MB.

  • PDF

Common Speech Database Collection for Telecommunications (통신망환경 한국어 공통음성 DB 구축)

  • Kim Sanghun;Park Moonwhan;Kim Hyunsuk
    • Proceedings of the KSPS conference
    • /
    • 2003.05a
    • /
    • pp.23-26
    • /
    • 2003
  • This paper presents common speech database collection for telecommunication applications. During 3 year project, we will construct very large scale speech and text databases for speech recognition, speech synthesis, and speaker identification. The common speech database has been considered various communication environments, distribution of speakers' sex, distribution of speakers' age, and distribution of speakers' region. It consists of Korean continuous digit, isolated words, and sentences which reflects Korean phonetic coverage. In addition, it consists of various pronunciation style such as read speech, dialogue speech, and semi-spontaneous speech. Thanks to the common speech databases, the duplicated resources of Korean speech industries are prohibited. It encourages domestic speech industries and activate speech technology domestic market.

  • PDF

Speech Recognition based Message Transmission System for the Hearing Impaired Persons (청각장애인을 위한 음성인식 기반 메시지 전송 시스템)

  • Kim, Sung-jin;Cho, Kyoung-woo;Oh, Chang-heon
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.22 no.12
    • /
    • pp.1604-1610
    • /
    • 2018
  • The speech recognition service is used as an ancillary means of communication by converting and visualizing the speaker's voice into text to the hearing impaired persons. However, in open environments such as classrooms and conference rooms it is difficult to provide speech recognition service to many hearing impaired persons. For this, a method is needed to efficiently provide it according to the surrounding environment. In this paper, we propose a system that recognizes the speaker's voice and transmits the converted text to many hearing impaired persons as messages. The proposed system uses the MQTT protocol to deliver messages to many users at the same time. The end-to-end delay was measured to confirm the service delay of the proposed system according to the QoS level setting of the MQTT protocol. As a result of the measurement, the delay between the most reliable Qos level 2 and 0 is 111ms, confirming that it does not have a great influence on conversation recognition.

MPEG-4TTS 현황 및 전망

  • 한민수
    • The Magazine of the IEIE
    • /
    • v.24 no.9
    • /
    • pp.91-98
    • /
    • 1997
  • Text-to-Speech(WS) technology has been attracting a lot of interest among speech engineers because of its own benefits. Namely, the possible application areas of talking computers, emergency alarming systems in speech, speech output devices for speech-impaired, and so on. Hence, many researchers have made significant progresses in the speech synthesis techniques in the sense of their own languages and as a result, the quality of current speech synthesizers are believed to be acceptable to normal users. These are partly why the MPEG group had decided to include the WS technology as one of its MPEG-4 functionalities. ETRI has made major contributions to the current MPEG-4 775 appearing in various MPEG-4 documents with relatively minor contributions from AT&T and NW. Main MPEG-4 functionalities presently available are; 1) use of original prosody for synthesized speech output, 2) trick mode functions for general users without breaking synthesized speech prosody, 3) interoperability with Facial Animation(FA) tools, and 4) dubbing a moving/anlmated picture with lip-shape pattern informations.

  • PDF

An Improved Coverless Text Steganography Algorithm Based on Pretreatment and POS

  • Liu, Yuling;Wu, Jiao;Chen, Xianyi
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.15 no.4
    • /
    • pp.1553-1567
    • /
    • 2021
  • Steganography is a current hot research topic in the area of information security and privacy protection. However, most previous steganography methods are not effective against steganalysis and attacks because they are usually carried out by modifying covers. In this paper, we propose an improved coverless text steganography algorithm based on pretreatment and Part of Speech (POS), in which, Chinese character components are used as the locating marks, then the POS is used to hide the number of keywords, the retrieval of stego-texts is optimized by pretreatment finally. The experiment is verified that our algorithm performs well in terms of embedding capacity, the embedding success rate, and extracting accuracy, with appropriate lengths of locating marks and the large scale of the text database.

A Comparative Study of Intonation Phrase Boundary Tones of Korean Produced by Korean Speakers and Chinese Speakers in the Reading of Korean Text (중국인 학습자들의 한국어 억양구 경계톤 실현 양상)

  • Yune, Young-Sook
    • Phonetics and Speech Sciences
    • /
    • v.2 no.4
    • /
    • pp.39-49
    • /
    • 2010
  • The purpose of this paper is to examine how Chinese speakers realize Korean intonation phrase (IP) boundary tones in the reading of a Korean text. Korean IP boundary tones play various roles in speech communication. They indicate prosodic constituents' boundaries while simultaneously performing pragmatic and grammatical functions. In order to express and understand Korean utterances correctly, it is necessary to understand the Korean IP boundary tone system. To investigate the IP boundary tone produced by Chinese speakers, we have specifically examined the type of boundary tones, the degree of internal pitch modulation of boundary tones, and the pitch difference between penultimate syllables and boundary tones. The results of each analysis were compared to the IP boundary tones produced by Korean native speakers. The results show that IP boundary tones were realized higher than penultimate syllables.

  • PDF

A Study on Noise-Robust Methods for Broadcast News Speech Recognition (방송뉴스 인식에서의 잡음 처리 기법에 대한 고찰)

  • Chung Yong-joo
    • MALSORI
    • /
    • no.50
    • /
    • pp.71-83
    • /
    • 2004
  • Recently, broadcast news speech recognition has become one of the most attractive research areas. If we can transcribe automatically the broadcast news and store their contents in the text form instead of the video or audio signal itself, it will be much easier for us to search for the multimedia databases to obtain what we need. However, the desirable speech signal in the broadcast news are usually affected by the interfering signals such as the background noise and/or the music. Also, the speech of the reporter who is speaking over the telephone or with the ill-conditioned microphone is severely distorted by the channel effect. The interfered or distorted speech may be the main reason for the poor performance in the broadcast news speech recognition. In this paper, we investigated some methods to cope with the problems and we could see some performance improvements in the noisy broadcast news speech recognition.

  • PDF