Search | Korea Science

One-shot multi-speaker text-to-speech using RawNet3 speaker representation (RawNet3를 통해 추출한 화자 특성 기반 원샷 다화자 음성합성 시스템)

Sohee Han;Jisub Um;Hoirin Kim
- Phonetics and Speech Sciences
- /
- v.16 no.1
- /
- pp.67-76
- /
- 2024
Recent advances in text-to-speech (TTS) technology have significantly improved the quality of synthesized speech, reaching a level where it can closely imitate natural human speech. Especially, TTS models offering various voice characteristics and personalized speech, are widely utilized in fields such as artificial intelligence (AI) tutors, advertising, and video dubbing. Accordingly, in this paper, we propose a one-shot multi-speaker TTS system that can ensure acoustic diversity and synthesize personalized voice by generating speech using unseen target speakers' utterances. The proposed model integrates a speaker encoder into a TTS model consisting of the FastSpeech2 acoustic model and the HiFi-GAN vocoder. The speaker encoder, based on the pre-trained RawNet3, extracts speaker-specific voice features. Furthermore, the proposed approach not only includes an English one-shot multi-speaker TTS but also introduces a Korean one-shot multi-speaker TTS. We evaluate naturalness and speaker similarity of the generated speech using objective and subjective metrics. In the subjective evaluation, the proposed Korean one-shot multi-speaker TTS obtained naturalness mean opinion score (NMOS) of 3.36 and similarity MOS (SMOS) of 3.16. The objective evaluation of the proposed English and Korean one-shot multi-speaker TTS showed a prediction MOS (P-MOS) of 2.54 and 3.74, respectively. These results indicate that the performance of our proposed model is improved over the baseline models in terms of both naturalness and speaker similarity.
https://doi.org/10.13064/KSSS.2024.16.1.067 인용 PDF

A Study of Subjective Speech Quality Measurement in VoIP (VoIP 음질의 주관적 평가에 관한 연구)

강영도;강진석;최연성;김장형
- Journal of the Korea Institute of Information and Communication Engineering
- /
- v.5 no.2
- /
- pp.279-287
- /
- 2001
In this paper, we discuss the scale of subjective speech quality measurement over VoIP(Voice over IP) network which is a component of broadband networks. Objective parameters of multimedia services like PSNR or jitter can easily measured and defined, but these factors are not easily meet the user's perceptual recognition. We suggest the speech quality measurement scale through the subjective measurement for end-to-end speech quality composed of sender-side quality, transmission quality, receiver-side quality, which provide the degree of correctness of representation of speaker, the degree of impairment caused by various factors, the degree of recognition of processed speech, respectively. Also, we examined the proposed method and verify it's availability.
PDF

Acoustic Analysis of Normal and Vocal Pathologic Voice Using Dr. Speech Science (Dr. Speech Science를 이용한 정상 및 후두질환 환자의 음향분석)

Lee, Hyung-Seok;Tae, Kyung;Jang, Kyung-Jin;Kim, Kyung-Woo;Kim, Kyung-Rae;Park, Chul-Won
- Journal of the Korean Society of Laryngology, Phoniatrics and Logopedics
- /
- v.8 no.2
- /
- pp.166-172
- /
- 1997
Background : For example, aerodynamic study, vibratory study, acoustic study, neuro-muscular test and psychoacoustic evaluation, a number of objective methods are now available for assessing pathologic voice change. They help to differentiate pathologic condition from normal condition and to monitor pathologic and aging change. These laboratory analyses are used commonly to monitor speech therapy and to follow a patient's recovery after surgery. Objectives : We investigated the values of jitter, shimmer and NNE of normal person and hoarseness patients in Korea. The values of Jitter and shimmer might be meaningful parameters distinguishing pathologic vibration from normal and recovery after surgery. Materials and Methods : Statistical significance between normal control and 48 subjects taken microlaryngeal surgery were compared with Dr. speech science program that is computerized system for acoustic analysis of voice production employed to determine vocal characteristics of pitch perturbation(jitter) and amplitude perturbation(shimmer). Results : The mean normal values of jitter and shimmer were 0.226${\pm}$0.110(%), 2.200${\pm}$0.421(%) in male and 0.164${\pm}$0.060(%), 2.063 ${\pm}$0.575(%) in female. In patients with vocal nodule, the preoperative and postoperative values of jitter and shimmer were valueless. In patients with vocal polyps, the preoperative and postoperative values of jitter and shimmer were valuable. Conclusion : Dr. speech science program was effective to monitor laryngeal disease and aging changes.
PDF

A study on combination of loss functions for effective mask-based speech enhancement in noisy environments (잡음 환경에 효과적인 마스크 기반 음성 향상을 위한 손실함수 조합에 관한 연구)

Jung, Jaehee;Kim, Wooil
- The Journal of the Acoustical Society of Korea
- /
- v.40 no.3
- /
- pp.234-240
- /
- 2021
In this paper, the mask-based speech enhancement is improved for effective speech recognition in noise environments. In the mask-based speech enhancement, enhanced spectrum is obtained by multiplying the noisy speech spectrum by the mask. The VoiceFilter (VF) model is used as the mask estimation, and the Spectrogram Inpainting (SI) technique is used to remove residual noise of enhanced spectrum. In this paper, we propose a combined loss to further improve speech enhancement. In order to effectively remove the residual noise in the speech, the positive part of the Triplet loss is used with the component loss. For the experiment TIMIT database is re-constructed using NOISEX92 noise and background music samples with various Signal to Noise Ratio (SNR) conditions. Source to Distortion Ratio (SDR), Perceptual Evaluation of Speech Quality (PESQ), and Short-Time Objective Intelligibility (STOI) are used as the metrics of performance evaluation. When the VF was trained with the mean squared error and the SI model was trained with the combined loss, SDR, PESQ, and STOI were improved by 0.5, 0.06, and 0.002 respectively compared to the system trained only with the mean squared error.
https://doi.org/10.7776/ASK.2021.40.3.234 인용 PDF KSCI

Analysis of Subjective Sound Quality Characteristics for the HVAC using the Design of Experiments : Sharp, Annoy (실험계획법을 이용한 차량공조시스템의 음질 특성 분석)

Yun, Tae-Kun;Sim, Hyun-Jin;Lee, Jung-Youn;Oh, Jae-Eung;Kim, Sung-Soo
- Proceedings of the Korean Society for Noise and Vibration Engineering Conference
- /
- 2005.05a
- /
- pp.634-637
- /
- 2005
A subjective index of sound quality when it hit him is required since human listening is very sensitive and complex. Sound quality evaluation it leads consequently rightly in each situation and it composes a sound quality factor. But one of the levels in interest frequency range is substitute we cannot see the tendency of frequency substitute at whole that is executes a clear voice evaluation. Design of experiment is used and dividing 12 equally in frequency domain, the sound quality using sharpness and annoyance is performed by modifying each of frequency domains. Design of experiment method reduces much number experiment very effectively and each main effect of domain solution analysis, such as a case of sharpness and annoyance, the change of domain (increase and decrease of sound pressure level, or change nil) can grasp a type of effect should have influenced to a sound quality, and it will be able to select the objective frequency domain which hits to the sound quality. Through these obtained results the physical changes of level at arbitrary frequency domain sensitivity can be adapted.
PDF

Nasometric and Acoustic Analysis in Experimentally Induced Velopharyngeal Insufficiency in Human (사람에서 유발시킨 구개인두부전증의 비음도와 음향학적 분석)

윤자복;성명훈;정원호;김광현
- Journal of the Korean Society of Laryngology, Phoniatrics and Logopedics
- /
- v.8 no.2
- /
- pp.210-216
- /
- 1997
Many tools have been used to evaluate the voice abnormalities of velopharyngeal insufficiency(VPI). The aim of study was to obtain the objective evaluation method of VPI by comparing the acoustic and nasalance data of experimentally induced VPI group and those of normal control group. Ten healthy young men were included in this study Mild and severe VPI were experimentally induced by retracting velopharyngeal movement. Using the nasometer, we obtained the nasalance score of the sustained oral vowels and those of three types of nasometer passages and the slope scores of nasogram of nasal words. And we analysed the change of formant frequencies for the sustained oral vowels and the changes of various parameters of hyper-tnasality by the computerized speech analysis system. The nasalance score of sustained /a/ was increased significantly in VPI conditions. There was no changes in the slope score of nasogram. On the acoustic speech analysis, the second formant frequencies of vowel /e/ and /i/ were decreased significantly in VPI conditions. This results suggested that the measurement of nasalance score and formant frequency might be useful in the evaluation of VPI.
PDF

A Study of Automatic Evaluation Platform for Speech Recognition Engine in the Vehicle Environment (자동차 환경내의 음성인식 자동 평가 플랫폼 연구)

Lee, Seong-Jae;Kang, Sun-Mee
- The Journal of Korean Institute of Communications and Information Sciences
- /
- v.37 no.7C
- /
- pp.538-543
- /
- 2012
The performance of the speech recognition engine is one of the most critical elements of the in-vehicle speech recognition interface. The objective of this paper is to develop an automated platform for running performance tests on the in-vehicle speech recognition engine. The developed platform comprise of main program, agent program, database management module, and statistical analysis module. A simulation environment for performance tests which mimics the real driving situations was constructed, and it was tested by applying pre-recorded driving noises and a speaker's voice as inputs. As a result, the validity of the results from the speech recognition tests was proved. The users will be able to perform the performance tests for the in-vehicle speech recognition engine effectively through the proposed platform.
https://doi.org/10.7840/KICS.2012.37.7C.538 인용 PDF KSCI

A study on deep neural speech enhancement in drone noise environment (드론 소음 환경에서 심층 신경망 기반 음성 향상 기법 적용에 관한 연구)

Kim, Jimin;Jung, Jaehee;Yeo, Chaneun;Kim, Wooil
- The Journal of the Acoustical Society of Korea
- /
- v.41 no.3
- /
- pp.342-350
- /
- 2022
In this paper, actual drone noise samples are collected for speech processing in disaster environments to build noise-corrupted speech database, and speech enhancement performance is evaluated by applying spectrum subtraction and mask-based speech enhancement techniques. To improve the performance of VoiceFilter (VF), an existing deep neural network-based speech enhancement model, we apply the Self-Attention operation and use the estimated noise information as input to the Attention model. Compared to existing VF model techniques, the experimental results show 3.77%, 1.66% and 0.32% improvements for Source to Distortion Ratio (SDR), Perceptual Evaluation of Speech Quality (PESQ), and Short-Time Objective Intelligence (STOI), respectively. When trained with a 75% mix of speech data with drone sounds collected from the Internet, the relative performance drop rates for SDR, PESQ, and STOI are 3.18%, 2.79% and 0.96%, respectively, compared to using only actual drone noise. This confirms that data similar to real data can be collected and effectively used for model training for speech enhancement in environments where real data is difficult to obtain.
https://doi.org/10.7776/ASK.2022.41.3.342 인용 PDF KSCI

Postoperative Change in Hypertrophic Rhinitis(Study Using Nasometer, CSL and Acoustic Rhinometer) (비후성 비염환자에서 음성검사 및 음향비강통기도검사를 이용한 수술전후 비교)

유영삼;우훈영;윤자복;최정환;조경래
- Journal of the Korean Society of Laryngology, Phoniatrics and Logopedics
- /
- v.12 no.1
- /
- pp.34-38
- /
- 2001
Background and Objectives : With the development of computerized systems, an objective evaluation methods of nasal speech and nasal geometry have become readily available by means of a simple, noninvasive technique. In this study, we assessed the nasality, nasal formant, nasal volume and nasal area in patients with hypertrophic rhinitis before and after turbinate surgery. Material and Method : With the nasometer, we measured nasalance, which reflects the ratio of acoustic energy output of nasal sounds from the nasal and oral cavities. With CSL 4300B, we measured nasal formants. We used acoustic rhinometer to measure nasal area and nasal volume. Postoperative changes of above factors were compared with preoperative values. Paired t-test and Pearson's correlation were used for statistical analysis. Results : The first nasal formant frequency, nasalance scores of three passages(baby, mamma and rabbit passages), minimal cross sectional area(MCA) of narrow side, nasal volume of narrow side and nasal volume of wide side had increased significantly after turbinate surgery (p <0.05). The MCA and nasal volume of narrow side and MCA of wide side showed significant correlation with nasalance score of rabbit passage and baby passage showed significant correlation with nasal volume of narrow side(p<0.05). Conclusion : There were significant increases in nasalance scores, first nasal formant frequency, MCA and nasal volume after turbinate surgery. Thus, we must consider the possibility of voice changes postoperatively in professional voice users.
PDF

Usefullness of the Vibration Pick-Up in Detection of Pitch for Synchronization of Laryngeal Stroboscopy (후두 스트로보스코프 검사의 신호 동기화를 위한 진동 검출기의 유용성)

Lee, Jin-Choon;Lee, Byung-Joo;Wang, Soo-Geun;Roh, Jung-Hoon;Kwon, Sun-Bok;Jo, Cheol-Woo
- Journal of the Korean Society of Laryngology, Phoniatrics and Logopedics
- /
- v.18 no.1
- /
- pp.26-32
- /
- 2007
Objective and Background: Laryngeal stroboscope is an useful equipment in evaluation of vocal cord vibration and in early detection of mucosal lesion including invasive cancer of the vocal cord. Recently Lee et al. (2006) developed portable stroboscope using voice as synchronization signal. It has been frequently impaired ability to synchronize the flashes even in normal female. Authors tried to investigate various methods including vibration pick-up, microphone, laryngeal microphone, and contact microphone for development of simple and accurate method like electroglottograph signal. The purpose of this study was to estimate wheher the vibration pick-up is available and is consistent with the signal of EGG. Subjects and Methods: Authors compared the signals between EGG and noncontact method such as voice, contact methods including vibration pick-up, laryngeal microphone, and contact microphone in normal twenty adults (male 10 and female 10). The number of peak in one cycle was compared with the number of the peak in EGG, and the percent of phase difference in the peak was compared with EGG Also, authors tried to investigate which site of vibration pick-up was most effective for synchronization of stobo flashes. Three site including anterior neck below the cricoid cartilage, thyroid ala, and suprahyoid region were analysed. Results: Among various methods for synchronization of strobo flashes, vibration pick-up was most effective method in peak detection. And anterior neck below cricoid cartilage was the most available site of the vibration pick-up. Conclusion: Authors suggest that vibration pick-up is most available and effective method for synchronization of strobo flashes.
PDF

Search Result 52, Processing Time 0.026 seconds

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)