• Title/Summary/Keyword: Voice Synthesis

Search Result 103, Processing Time 0.034 seconds

A Study on Compensation of Amplitude in Multi Pulse (멀티펄스의 진폭보정에 관한 연구)

  • Lee, See-Woo
    • Journal of the Korea Academia-Industrial cooperation Society
    • /
    • v.12 no.9
    • /
    • pp.4119-4124
    • /
    • 2011
  • In a MPC coding using excitation source of voiced and unvoiced, it would be a distortion of speech waveform in case of increasing or decreasing of speech signal amplitude in a frame. This is caused by normalization of synthesis speech signal in the process of restoration the multi-pulses of representation section. To solve this problem, this paper present a method of amplitude compensation(AC-MPC) in a multi-pulses each pitch interval in order to reduce distortion of speech waveform. I was confirmed that the method can be synthesized close to the original speech waveform. And I evaluate the MPC and AC-MPC using amplitude compensation method. As a result, SNRseg of AC-MPC was improved 0.7dB for female voice and 0.7dB for male voice respectively. Compared to the MPC, SNRseg of AC-MPC has been improved that I was able to control the distortion of the speech waveform finally. And so, I expect to be able to this method for cellular phone and smart phone using excitation source of low bit rate.

Text-to-speech with linear spectrogram prediction for quality and speed improvement (음질 및 속도 향상을 위한 선형 스펙트로그램 활용 Text-to-speech)

  • Yoon, Hyebin
    • Phonetics and Speech Sciences
    • /
    • v.13 no.3
    • /
    • pp.71-78
    • /
    • 2021
  • Most neural-network-based speech synthesis models utilize neural vocoders to convert mel-scaled spectrograms into high-quality, human-like voices. However, neural vocoders combined with mel-scaled spectrogram prediction models demand considerable computer memory and time during the training phase and are subject to slow inference speeds in an environment where GPU is not used. This problem does not arise in linear spectrogram prediction models, as they do not use neural vocoders, but these models suffer from low voice quality. As a solution, this paper proposes a Tacotron 2 and Transformer-based linear spectrogram prediction model that produces high-quality speech and does not use neural vocoders. Experiments suggest that this model can serve as the foundation of a high-quality text-to-speech model with fast inference speed.

Development of Half-Mirror Interface System and Its Application for Ubiquitous Environment (유비쿼터스 환경을 위한 하프미러형 인터페이스 시스템 개발과 응용)

  • Kwon Young-Joon;Kim Dae-Jin;Lee Sang-Wan;Bien Zeungnam
    • Journal of Institute of Control, Robotics and Systems
    • /
    • v.11 no.12
    • /
    • pp.1020-1026
    • /
    • 2005
  • In the era of ubiquitous computing, human-friendly man-machine interface is getting more attention due to its possibility to offer convenient services. For this, in this paper, we introduce a 'Half-Mirror Interface System (HMIS)' as a novel type of human-friendly man-machine interfaces. Basically, HMIS consists of half-mirror, USB-Webcam, microphone, 2ch-speaker, and high-speed processing unit. In our HMIS, two principal operation modes are selected by the existence of the user in front of it. The first one, 'mirror-mode', is activated when the user's face is detected via USB-Webcam. In this mode, HMIS provides three basic functions such as 1) make-up assistance by magnifying an interested facial component and TTS (Text-To-Speech) guide for appropriate make-up, 2) Daily weather information provider via WWW service, 3) Health monitoring/diagnosis service using Chinese medicine knowledge. The second one, 'display-mode' is designed to show decorative pictures, family photos, art paintings and so on. This mode is activated when the user's face is not detected for a time being. In display-mode, we also added a 'healing-window' function and 'healing-music player' function for user's psychological comfort and/or relaxation. All these functions are accessible by commercially available voice synthesis/recognition package.

A study on Web interface for the Blind. (시각장애인을 위한 웹 인터페이스에 관한 연구)

  • Choi, T.J.;Jang, B.T.;Kim, H.K.;Kim, J.K.;Hur, W.
    • Proceedings of the IEEK Conference
    • /
    • 1999.06a
    • /
    • pp.559-562
    • /
    • 1999
  • In this paper, we developed on internet based assembly information display system for the blind. The system is consist of hardware and software. The hardware is consist of a voice synthesis device and a tactile display for character information, and the software is consist of internet web browser for the blind and braille program. The tactile-device system consists of a control unit, pin array, pin generator, serial port, and a power supply. The pin exerted by a electromagnetic method, solenoid. The internet web browser separates the character and image from internet web page, and character information in the web page is converted to braille and fed to sound system. Also the image in the web page can be printed developed tactile display. As the results of experiment, the blind could access the internet web site by using this system and understand various internet information.

  • PDF

Dialogic Male Voice Triphone DB Construction (남성 음성 triphone DB 구축에 관한 연구)

  • Kim, Yu-Jin;Baek, Sang-Hoon;Han, Min-Soo;Chung, Jae-Ho
    • The Journal of the Acoustical Society of Korea
    • /
    • v.15 no.2
    • /
    • pp.61-71
    • /
    • 1996
  • In this paper, dialogic triphone data base construction for triphone synthesis system is discussed. Particularly, in this work, dialogic speech data is collected from the broadcast media, and three different transcription steps are taken. Total 10 hours of speech data are collected. Among them, six hours of speech data are used for the triphone data base construction, and the rest four hours of data are reserved. Dialogic speech data base construction is far different from the reciting speech data base construction. This paper describes various steps that necessary for the dialogic triphone data base construction from collecting speech data to triphone unit labeling.

  • PDF

Synthesis of Expressive Talking Heads from Speech with Recurrent Neural Network (RNN을 이용한 Expressive Talking Head from Speech의 합성)

  • Sakurai, Ryuhei;Shimba, Taiki;Yamazoe, Hirotake;Lee, Joo-Ho
    • The Journal of Korea Robotics Society
    • /
    • v.13 no.1
    • /
    • pp.16-25
    • /
    • 2018
  • The talking head (TH) indicates an utterance face animation generated based on text and voice input. In this paper, we propose the generation method of TH with facial expression and intonation by speech input only. The problem of generating TH from speech can be regarded as a regression problem from the acoustic feature sequence to the facial code sequence which is a low dimensional vector representation that can efficiently encode and decode a face image. This regression was modeled by bidirectional RNN and trained by using SAVEE database of the front utterance face animation database as training data. The proposed method is able to generate TH with facial expression and intonation TH by using acoustic features such as MFCC, dynamic elements of MFCC, energy, and F0. According to the experiments, the configuration of the BLSTM layer of the first and second layers of bidirectional RNN was able to predict the face code best. For the evaluation, a questionnaire survey was conducted for 62 persons who watched TH animations, generated by the proposed method and the previous method. As a result, 77% of the respondents answered that the proposed method generated TH, which matches well with the speech.

Speech Animation Synthesis based on a Korean Co-articulation Model (한국어 동시조음 모델에 기반한 스피치 애니메이션 생성)

  • Jang, Minjung;Jung, Sunjin;Noh, Junyong
    • Journal of the Korea Computer Graphics Society
    • /
    • v.26 no.3
    • /
    • pp.49-59
    • /
    • 2020
  • In this paper, we propose a speech animation synthesis specialized in Korean through a rule-based co-articulation model. Speech animation has been widely used in the cultural industry, such as movies, animations, and games that require natural and realistic motion. Because the technique for audio driven speech animation has been mainly developed for English, however, the animation results for domestic content are often visually very unnatural. For example, dubbing of a voice actor is played with no mouth motion at all or with an unsynchronized looping of simple mouth shapes at best. Although there are language-independent speech animation models, which are not specialized in Korean, they are yet to ensure the quality to be utilized in a domestic content production. Therefore, we propose a natural speech animation synthesis method that reflects the linguistic characteristics of Korean driven by an input audio and text. Reflecting the features that vowels mostly determine the mouth shape in Korean, a coarticulation model separating lips and the tongue has been defined to solve the previous problem of lip distortion and occasional missing of some phoneme characteristics. Our model also reflects the differences in prosodic features for improved dynamics in speech animation. Through user studies, we verify that the proposed model can synthesize natural speech animation.

The Effectiveness of Electroglottographic Parameters in Differential Diagnosis of Laryngeal Cancer (후두암 감별진단에 있어 성문전도(Electroglottograph) 파라미터의 유용성)

  • 송인무;고의경;전경명;권순복;김기련;전계록;김광년;정동근;조철우
    • Journal of the Korean Society of Laryngology, Phoniatrics and Logopedics
    • /
    • v.14 no.1
    • /
    • pp.16-25
    • /
    • 2003
  • Background and Objectives : Electroglottography(EGG) is a non-invasive method of monitoring the vocal cord vibration by measuring the variation of physiological impedance across the vocal folds through the neck skin. It reveals especially the vocal fold contact area and is widely used for basic laryngeal researches, voice analysis and synthesis. The purpose of this study is to investigate the effectiveness of EGG parameters in differential diagnosis of laryngeal cancer. Materials and Methods : The author investigated 10 laryngeal cancer and 25 benign laryngeal disease patients who visited at the Department of Otolaryngology, Pusan National University Hospital. The EGG equipment was devised in the author's Department. Among various parameters of EGG, closed quotient(CQ), speed quotient(SQ), speed index(SI), Jitter, Shimmer, Fo were determined by an analysis program made with MATLAB 6.5$^{\circledR}$(Mathwork, Inc.). In order to differentiate various laryngeal diseases from pathologic voice signals, the author has used the electroglottographic parameters using the neural network of multilayer perceptron structure. Results : SQ, SI, Jitter and Shimmer values except those of CQ and Fo showed remarkable differences between benign and malignant laryngeal disease groups. From the artificial neural network, the percentage of differentiating the laryngeal cancer was over 80% in SQ, SI, Jitter, Shimmer except for CQ and Fo. These results indicated that it is possible to discriminate the benign and malignant laryngeal diseases by EGG parameters using the artificial neural network. Conclusion : If parameters of EGG which can reveal for the pathology of laryngeal diseases are additionally developed and the current classification algorithm is improved, the discrimination of laryngeal cancer will become much more accurate.

  • PDF

A Comparative Study of the Speech Signal Parameters for the Consonants of Pyongyang and Seoul Dialects - Focused on "ㅅ/ㅆ" (평양 지역어와 서울 지역어의 자음에 대한 음성신호 파라미터들의 비교 연구 - "ㅅ/ ㅆ"을 중심으로)

  • So, Shin-Ae;Lee, Kang-Hee;You, Kwang-Bock;Lim, Ha-Young
    • Asia-pacific Journal of Multimedia Services Convergent with Art, Humanities, and Sociology
    • /
    • v.8 no.6
    • /
    • pp.927-937
    • /
    • 2018
  • In this paper the comparative study of the consonants of Pyongyang and Seoul dialects of Korean is performed from the perspective of the signal processing which can be regarded as the basis of engineering applications. Until today, the most of speech signal studies were primarily focused on the vowels which are playing important role in the language evolution. In any language, however, the number of consonants is greater than the number of vowels. Therefore, the research of consonants is also important. In this paper, with the vowel study of the Pyongyang dialect, which was conducted by phonological research and experimental phonetic methods, the consonant studies are processed based on an engineering operation. The alveolar consonant, which has demonstrated many differences in the phonetic value between Pyongyang and Seoul dialects, was used as the experimental data. The major parameters of the speech signal analysis - formant frequency, pitch, spectrogram - are measured. The phonetic values between the two dialects were compared with respect to /시/ and /씨/ of Korean language. This study can be used as the basis for the voice recognition and the voice synthesis in the future.

A Study on Activation of Online Performances Using Sac on Screen Project Analysis (Sac on Screen 사업 분석을 통한 온라인 공연 활성화 방안 연구)

  • Kim, Gyu-Jin;Na, Yun-Bin
    • The Journal of the Korea Contents Association
    • /
    • v.20 no.8
    • /
    • pp.114-127
    • /
    • 2020
  • The online performance market is increasing due to recent pandemic events. However, due to the short introduction time of domestic online performances, there is a lack of related prior studies or success stories. In addition, most of these projects are short-lived projects or poor profits, so it is necessary to study how to activate them. The Sac on Screen project, which has been in progress since 10 years ago, has its own imaging experience, and the screening works and screening venues are also diverse, so it is an object of study. In addition, since annual satisfaction surveys are conducted, the business was evaluated based on the voice of customers from the data of the past three years. Based on the analyzed results, a free and paid version of the business model canvas was drawn through a group of experts. As a result of this synthesis, the following major implications were drawn. First, expanding research on online performances, second, needing a sense of responsibility for quality management of content, third, increasing diversity in content selection, and fourth, enhancing the liveliness of online performances, Fifth, efforts are needed to attract private investment and develop value-added products.