• Title/Summary/Keyword: TTS

Search Result 306, Processing Time 0.03 seconds

One-shot multi-speaker text-to-speech using RawNet3 speaker representation (RawNet3를 통해 추출한 화자 특성 기반 원샷 다화자 음성합성 시스템)

  • Sohee Han;Jisub Um;Hoirin Kim
    • Phonetics and Speech Sciences
    • /
    • v.16 no.1
    • /
    • pp.67-76
    • /
    • 2024
  • Recent advances in text-to-speech (TTS) technology have significantly improved the quality of synthesized speech, reaching a level where it can closely imitate natural human speech. Especially, TTS models offering various voice characteristics and personalized speech, are widely utilized in fields such as artificial intelligence (AI) tutors, advertising, and video dubbing. Accordingly, in this paper, we propose a one-shot multi-speaker TTS system that can ensure acoustic diversity and synthesize personalized voice by generating speech using unseen target speakers' utterances. The proposed model integrates a speaker encoder into a TTS model consisting of the FastSpeech2 acoustic model and the HiFi-GAN vocoder. The speaker encoder, based on the pre-trained RawNet3, extracts speaker-specific voice features. Furthermore, the proposed approach not only includes an English one-shot multi-speaker TTS but also introduces a Korean one-shot multi-speaker TTS. We evaluate naturalness and speaker similarity of the generated speech using objective and subjective metrics. In the subjective evaluation, the proposed Korean one-shot multi-speaker TTS obtained naturalness mean opinion score (NMOS) of 3.36 and similarity MOS (SMOS) of 3.16. The objective evaluation of the proposed English and Korean one-shot multi-speaker TTS showed a prediction MOS (P-MOS) of 2.54 and 3.74, respectively. These results indicate that the performance of our proposed model is improved over the baseline models in terms of both naturalness and speaker similarity.

Real-time Implementation of a 8 channel TTS Using a TMS320C6201 DSP (TMS320C6201 DSP를 이용한 8 채널 실시간 TTS 구현)

  • 최준용;박익현;박권원;안진형
    • Proceedings of the IEEK Conference
    • /
    • 2000.09a
    • /
    • pp.497-500
    • /
    • 2000
  • 본 논문에서는 TTS 알고리듬을 16 비트 고정 소수점 DSP인 TMS320C6201을 이용해 다채널 실시간 구현하였으며, 실제로 음성처리 부가 서비스 시스템에 보드 형태로 구현하여 응용하였다. 구현된 TTS는 최적화 작업을 통해 최대 40 MHz 클록으로 채널 당 2초의 합성음 생성하도록 했으며, 개발된 TTS 보드는 두 개의 DSP를 사용하여 DSP 당 8 채널씩 총 16 채널을 수용하였다 실험 결과, 모든 채널에서 실시간적으로 음성 합성이 수행됨을 확인하였다.

  • PDF

Characteristics of directly sputtered AI cathode film using twin target sputtering system for OLEDs

  • Moon, Jong-Min;Lee, Sang-Hyeon;Kim, Han-Ki
    • 한국정보디스플레이학회:학술대회논문집
    • /
    • 2007.08a
    • /
    • pp.655-658
    • /
    • 2007
  • Characteristics of Al cathode films deposited by using specially designed twin target sputter (TTS) system were investigated. It was found that Al cathode films prepared by TTS were amorphous structure with nanocrystallines due to low substrate temperature and OLEDs fabricated using TTS system have low leakage current density at reverse bias because of effective confinement of energetic particles during sputtering process.

  • PDF

Performance Comparison of State-of-the-Art Vocoder Technology Based on Deep Learning in a Korean TTS System (한국어 TTS 시스템에서 딥러닝 기반 최첨단 보코더 기술 성능 비교)

  • Kwon, Chul Hong
    • The Journal of the Convergence on Culture Technology
    • /
    • v.6 no.2
    • /
    • pp.509-514
    • /
    • 2020
  • The conventional TTS system consists of several modules, including text preprocessing, parsing analysis, grapheme-to-phoneme conversion, boundary analysis, prosody control, acoustic feature generation by acoustic model, and synthesized speech generation. But TTS system with deep learning is composed of Text2Mel process that generates spectrogram from text, and vocoder that synthesizes speech signals from spectrogram. In this paper, for the optimal Korean TTS system construction we apply Tacotron2 to Tex2Mel process, and as a vocoder we introduce the methods such as WaveNet, WaveRNN, and WaveGlow, and implement them to verify and compare their performance. Experimental results show that WaveNet has the highest MOS and the trained model is hundreds of megabytes in size, but the synthesis time is about 50 times the real time. WaveRNN shows MOS performance similar to that of WaveNet and the model size is several tens of megabytes, but this method also cannot be processed in real time. WaveGlow can handle real-time processing, but the model is several GB in size and MOS is the worst of the three vocoders. From the results of this study, the reference criteria for selecting the appropriate method according to the hardware environment in the field of applying the TTS system are presented in this paper.

Differences in Temporary Threshold Shift and Recovery Patterns Depending on Sound Type and Pressure (소리의 종류와 크기에 따른 일과성 청력 역치 상승과 회복의 차이)

  • Lee, Chae Kwan
    • Journal of Korean Society of Occupational and Environmental Hygiene
    • /
    • v.30 no.4
    • /
    • pp.387-393
    • /
    • 2020
  • Objective: This study aimed to investigate the differences in temporary threshold shift (TTS) and recovery patterns according to different types of sound and volume. Methods: TTS and recovery patterns were assessed for eight students after 30-minute exposure to both 70.0 dB and 90.0 dB of factory noise (noise) as well as music. TTS was measured before exposure and two minutes post exposure, and recovery patterns were evaluated every 10 minutes for one hour. The subjects performed activities of daily life and sleeping times as usual but taking drugs or drinking alcohol were prohibited. The experiment was repeated three times with an interval of at least 16 hours. ANOVA and T-test were carried out using SPSS 19.0 for Windows. Results: The hearing threshold of all subjects before exposure was less than 30 dB at all frequencies. Mean TTSs of 70 dB noise and 90 dB noise exposure were 0.14 and 4.48 dB (p<0.001). Meanwhile, the difference in music was insignificant (-0.63 dB and 0.55 dB, p=0.063). A significance in the difference was also found between the mean TTS of music and noise exposure, more obviously at 90.0 dB (p<0.001) than at 70 dB (p=0.232). The TTS differences were found frequency-wise in terms of sound type. Mean TTS by frequency was higher at 4,000 and 6,000 Hz than at other frequencies, and higher in noise than music at the same sound pressure. The TTS difference in each frequency between both sound types was significant at 90 dB (p<0.001). Subjects mostly recovered from TTS in one hour after exposure, but not with 90 dB-noise exposure. Conclusion: TTS and recovery patterns were different depending on the sound type. When exposed to factory noise, TTS was greater and recovery time was longer compared to music at the same sound pressure. These results suggested that the difference in cognitive processes and psychological factors according to the type of sound causes a change in TTS and recovery.

A Study on the Sound Effect for Improving Customer's Speech Recognition in the TTS-based Shop Music Broadcasting Service (TTS를 이용한 매장음원방송에서 고객의 인지도 향상을 위한 음향효과 연구)

  • Kang, Sun-Mee;Kim, Hyun-Deuc;Chang, Moon-Soo
    • Phonetics and Speech Sciences
    • /
    • v.1 no.4
    • /
    • pp.105-109
    • /
    • 2009
  • This thesis describes the method for well voice announcement using the TTS(Text-To-Speech) technology in the shop music broadcasting service. Offering a high quality TTS sound service for each shop requires a great expense. According to a report on the architectural acoustics the room acoustic indexes such as reverberation time and early decay time are closely connected with a subjective awareness about acoustics. By using the result the customers will be able to recognize better the voice announcement by applying sound effect to speech files made by TTS. The result of an aural comprehension examination has shown better about almost all of the parameters by applying reverb effect to TTS sound.

  • PDF

Miniscalpel Needle Therapy with Integrative Korean Medical Treatment for Carpal Tunnel or Tarsal Tunnel Syndrome: Case Series of Three Patients

  • Kim, Jae Ik;Kim, Hye Su;Park, Gi Nam;Jeon, Ju Hyon;Kim, Jung Ho;Kim, Young Il
    • Journal of Acupuncture Research
    • /
    • v.34 no.3
    • /
    • pp.139-152
    • /
    • 2017
  • Objectives : This study reports the clinical effects of miniscalpel needle therapy in patients with carpal tunnel or tarsal tunnel syndrome. Methods : Three patients with carpal tunnel syndrome (CTS) or tarsal tunnel syndrome (TTS) (first case, patient with CTS and TTS; second case, patient with CTS; and third case, patient with TTS) were treated with miniscalpel needle (MSN) therapy and integrative Korean medical treatment. The Numeric Rating Scale (NRS), Neuropathic Pain Scale (NPS), Boston scale score, and AOFAS (American Orthopaedic Foot and Ankle Society) ankle-hindfoot score were measured. Results : In general, outcome measures after treatment showed improvement in all cases. In the first case (CTS and TTS), scores on the NRS, NPS, and Boston scale decreased, and AOFAS ankle-hind foot scores increased. In addition, Tinel's sign showed improvement. In the second case (CTS), scores on the NRS, NPS, and Boston scale, and Tinel's sign, were decreased. In the third case (TTS), scores on the NRS and NPS, and Tinel's sign, showed improvement, and AOFAS ankle-hind foot scores were increased. Conclusion : These results suggest that MSN therapy has a meaningful clinical effect in CTS and TTS.

Statistical analysis on the fluence factor of surveillance test data of Korean nuclear power plants

  • Lee, Gyeong-Geun;Kim, Min-Chul;Yoon, Ji-Hyun;Lee, Bong-Sang;Lim, Sangyeob;Kwon, Junhyun
    • Nuclear Engineering and Technology
    • /
    • v.49 no.4
    • /
    • pp.760-768
    • /
    • 2017
  • The transition temperature shift (TTS) of the reactor pressure vessel materials is an important factor that determines the lifetime of a nuclear power plant. The prediction of the TTS at the end of a plant's lifespan is calculated based on the equation of Regulatory Guide 1.99 revision 2 (RG1.99/2) from the US. The fluence factor in the equation was expressed as a power function, and the exponent value was determined by the early surveillance data in the US. Recently, an advanced approach to estimate the TTS was proposed in various countries for nuclear power plants, and Korea is considering the development of a new TTS model. In this study, the TTS trend of the Korean surveillance test results was analyzed using a nonlinear regression model and a mixed-effect model based on the power function. The nonlinear regression model yielded a similar exponent as the power function in the fluence compared with RG1.99/2. The mixed-effect model had a higher value of the exponent and showed superior goodness of fit compared with the nonlinear regression model. Compared with RG1.99/2 and RG1.99/3, the mixed-effect model provided a more accurate prediction of the TTS.

An end-to-end synthesis method for Korean text-to-speech systems (한국어 text-to-speech(TTS) 시스템을 위한 엔드투엔드 합성 방식 연구)

  • Choi, Yeunju;Jung, Youngmoon;Kim, Younggwan;Suh, Youngjoo;Kim, Hoirin
    • Phonetics and Speech Sciences
    • /
    • v.10 no.1
    • /
    • pp.39-48
    • /
    • 2018
  • A typical statistical parametric speech synthesis (text-to-speech, TTS) system consists of separate modules, such as a text analysis module, an acoustic modeling module, and a speech synthesis module. This causes two problems: 1) expert knowledge of each module is required, and 2) errors generated in each module accumulate passing through each module. An end-to-end TTS system could avoid such problems by synthesizing voice signals directly from an input string. In this study, we implemented an end-to-end Korean TTS system using Google's Tacotron, which is an end-to-end TTS system based on a sequence-to-sequence model with attention mechanism. We used 4392 utterances spoken by a Korean female speaker, an amount that corresponds to 37% of the dataset Google used for training Tacotron. Our system obtained mean opinion score (MOS) 2.98 and degradation mean opinion score (DMOS) 3.25. We will discuss the factors which affected training of the system. Experiments demonstrate that the post-processing network needs to be designed considering output language and input characters and that according to the amount of training data, the maximum value of n for n-grams modeled by the encoder should be small enough.

A Study of Korean TTS Listening Speed for the Blind Using a Screen Reader (스크린리더를 사용하는 시각장애인의 한국어 합성음 청취속도 연구)

  • Lee, Heeyeon;Hong, Ki-Hyung
    • Phonetics and Speech Sciences
    • /
    • v.5 no.3
    • /
    • pp.63-69
    • /
    • 2013
  • The purpose of this study was to evaluate the maximum and optimal listening speed of Korean TTS for the blind. Five blind participants took part in this study. The instruments used in this study were 17 sentence sets (2 sets for an excercise, 10 sets for a repeated test, and 5 sets for a random test), with short meaningful sentences (the same sentences for the repeated test, different sentences for the random test) with 15 differentiated speeds (Range=0.8-3.6, SD=0.2). Each participant's maximum and quickest listening speeds were calculated by objective recall accuracy (determined by the number of correctly recalled syllables/the total number of syllables in a sentence X 100) and subjective recall accuracy (recall accuracy judged by each participant's subjective evaluation). The results showed that the participants' recall accuracy had a tendency to increase as the TTS speed decreased. Participants' subjective recall accuracy was higher than objective recall accuracy in the repeated tests and vice versa in the random tests. The results also revealed that the participants' sentence familiarity had an influence on their Korean TTS listening speed.