• Title/Summary/Keyword: Spectrogram

Search Result 233, Processing Time 0.019 seconds

Acoustic features of diphthongs produced by children with speech sound disorders (말소리장애 아동이 산출한 이중모음의 음향학적 특성)

  • Cho, Yoon Soo;Pyo, Hwa Young;Han, Jin Soon;Lee, Eun Ju
    • Phonetics and Speech Sciences
    • /
    • v.13 no.1
    • /
    • pp.65-72
    • /
    • 2021
  • The aim of this study is to prepare basic data that can be used for evaluation and intervention by investigating the characteristics of diphthongs produced by children with speech sound disorders. To confirm this, two groups of 10 children each, with and without speech sound disorders were asked to imitate the meaningless two-syllable 'diphthongs + da'. The slope of F1 and F2, amount of change of formant, and duration of glide were analyzed by Praat (version 6.1.16). As a result, the difference between the two groups was found in the slope of F1 of /ju/. Children with speech sound disorders had smaller changes in formants and shorter duration time values compared to normal children, and there were statistically significant differences. The amount of change in formant in the glide was found in F1 of /ju, jɛ/, F2 of /jɑ, jɛ/, and there were significant differences in the duration of glide in /ju, jɛ/. The results of this study showed that the range of articulation of diphthongs in children with speech sound disorders is relatively smaller than that of normal children, thus the time it takes to articulate was reduced. These results suggest that the range of articulation and acoustic analysis should be further investigated for evaluation and intervention regarding diphthongs of children with speech sound disorders.

A Multi-speaker Speech Synthesis System Using X-vector (x-vector를 이용한 다화자 음성합성 시스템)

  • Jo, Min Su;Kwon, Chul Hong
    • The Journal of the Convergence on Culture Technology
    • /
    • v.7 no.4
    • /
    • pp.675-681
    • /
    • 2021
  • With the recent growth of the AI speaker market, the demand for speech synthesis technology that enables natural conversation with users is increasing. Therefore, there is a need for a multi-speaker speech synthesis system that can generate voices of various tones. In order to synthesize natural speech, it is required to train with a large-capacity. high-quality speech DB. However, it is very difficult in terms of recording time and cost to collect a high-quality, large-capacity speech database uttered by many speakers. Therefore, it is necessary to train the speech synthesis system using the speech DB of a very large number of speakers with a small amount of training data for each speaker, and a technique for naturally expressing the tone and rhyme of multiple speakers is required. In this paper, we propose a technology for constructing a speaker encoder by applying the deep learning-based x-vector technique used in speaker recognition technology, and synthesizing a new speaker's tone with a small amount of data through the speaker encoder. In the multi-speaker speech synthesis system, the module for synthesizing mel-spectrogram from input text is composed of Tacotron2, and the vocoder generating synthesized speech consists of WaveNet with mixture of logistic distributions applied. The x-vector extracted from the trained speaker embedding neural networks is added to Tacotron2 as an input to express the desired speaker's tone.

Comparative study of data augmentation methods for fake audio detection (음성위조 탐지에 있어서 데이터 증강 기법의 성능에 관한 비교 연구)

  • KwanYeol Park;Il-Youp Kwak
    • The Korean Journal of Applied Statistics
    • /
    • v.36 no.2
    • /
    • pp.101-114
    • /
    • 2023
  • The data augmentation technique is effectively used to solve the problem of overfitting the model by allowing the training dataset to be viewed from various perspectives. In addition to image augmentation techniques such as rotation, cropping, horizontal flip, and vertical flip, occlusion-based data augmentation methods such as Cutmix and Cutout have been proposed. For models based on speech data, it is possible to use an occlusion-based data-based augmentation technique after converting a 1D speech signal into a 2D spectrogram. In particular, SpecAugment is an occlusion-based augmentation technique for speech spectrograms. In this study, we intend to compare and study data augmentation techniques that can be used in the problem of false-voice detection. Using data from the ASVspoof2017 and ASVspoof2019 competitions held to detect fake audio, a dataset applied with Cutout, Cutmix, and SpecAugment, an occlusion-based data augmentation method, was trained through an LCNN model. All three augmentation techniques, Cutout, Cutmix, and SpecAugment, generally improved the performance of the model. In ASVspoof2017, Cutmix, in ASVspoof2019 LA, Mixup, and in ASVspoof2019 PA, SpecAugment showed the best performance. In addition, increasing the number of masks for SpecAugment helps to improve performance. In conclusion, it is understood that the appropriate augmentation technique differs depending on the situation and data.