• Title/Summary/Keyword: Speech Spectrogram

Search Result 90, Processing Time 0.021 seconds

An Acoustic Analysis of Vowels for Severe-profound Hearing Impaired Children (최고도이상의 청력손실을 가진 아동의 모음음형대 분석)

  • Huh, Myung-Jin
    • Speech Sciences
    • /
    • v.14 no.2
    • /
    • pp.65-71
    • /
    • 2007
  • The severe-profound hearing impaired children have various disorders in everday communication due to the lack of hearing feedback. Especially, their speech produced unstable voice, omission and distortion of articulation, pitch break, cul-de-sac voice, and so on so that they were difficult to accurately deliver an intended message. This study attempts to analyze the acoustic characteristics of 4 vowel sounds produced by 35 severe-profound hearing impaired children using CSL(Computerized Speech Lab, Model 4300b). The formant data were obtained from the spectrogram and analyzed data by 12 formant filter and auto-correlation among the formants. Results showed that the hearing impaired children's formant values came out very high. They produced the vowels at the mode of hypertension with unstable voice. In order to improve their speech, they would need some adequate auditory feedback.

  • PDF

Acoustic Characteristics of 'Short Rushes of Speech' using Alternate Motion Rates in Patients with Parkinson's Disease (파킨슨병 환자의 교대운동속도 과제에서 관찰된 '말 뭉침'의 음향학적 특성)

  • Kim, Sun Woo;Yoon, Ji Hye;Lee, Seung Jin
    • Phonetics and Speech Sciences
    • /
    • v.7 no.2
    • /
    • pp.55-62
    • /
    • 2015
  • It is widely accepted that Parkinson's disease(PD) is the most common cause of hypokinetic dysarthria, and its characteristics of 'short rushes of speech' have become more evident along with the severity of motor disorders. Speech alternate motion rates (AMRs) are particularly useful for observing not only rate abnormalities but also deviant speech. However, relatively little is known about the characteristics of 'short rushes of speech' in terms of AMRs of PD except for the perceptual characteristics. The purpose of this study was to examine which acoustic features of 'short rushes of speech' in terms of AMRs are a robust indicator of Parkinsonian speech. Numbers of syllabic repetitions (/pə/, /tə/, /kə/) in AMR tasks were analyzed through acoustic methods observing a spectrogram of the Computerized Speech Lab in 9 patients with PD. Acoustically, we found three characteristics of 'short rushes of speech': 1) Vocalized consonants without closure duration(VC) 76.3%; 2) No consonant segmentation(NC) 18.6%; 3) No vowel formant frequency(NV) 5.1%. Based on these results, 'short rushes of speech' may affect the failure to reach and maintain the phonatory targets. In order to best achieve the therapeutic goals, and to make the treatment most efficacious, it is important to incorporate training methods which are based on both phonation and articulation.

A Study on Speaker Identification Parameter Using Difference and Correlation Coeffieicent of Digit_sound Spectrum (숫자음의 스펙트럼 차이값과 상관계수를 이용한 화자인증 파라미터 연구)

  • Lee, Hoo-Dong;Kang, Sun-Mee;Chang, Moon-Soo;Yang, Byung-Gon
    • Speech Sciences
    • /
    • v.11 no.3
    • /
    • pp.131-142
    • /
    • 2004
  • Speaker identification system basically functions by comparing spectral energy of an individual production model with that of an input signal. This study aimed to develop a new speaker identification system from two parameters from the spectral energy of numeric sounds: difference sum and correlation coefficient. A narrow-band spectrogram yielded more stable spectral energy across time than a wide-band one. In this paper, we collected empirical data from four male speakers and tested the speaker identification system. The subjects produced 18 combinations of three-digit numeric. sounds !en times each. Five productions of each three-digit number were statistically averaged to make a model for each speaker. Then, the remaining five productions were tested on the system. Results showed that when the threshold for the absolute difference sum was set to 1200, all the speakers could not pass the system while everybody could pass if set to 2800. The minimum correlation coefficient to allow all to pass was 0.82 while the coefficient of 0.95 rejected all. Thus, both threshold levels can be adjusted to the need of speaker identification system, which is desirable for further study.

  • PDF

Estimation of Fundamental Frequency Using an Instantaneous Frequency Based on the Symmetric Higher Order Differential Energy Operator (대칭구조를 갖는 일반적인 고차의 미분 에너지함수를 기반한 순간주파수를 이용한 음성의 기본주파수 추정)

  • Iem, Byeong-Gwan
    • The Transactions of The Korean Institute of Electrical Engineers
    • /
    • v.60 no.12
    • /
    • pp.2374-2379
    • /
    • 2011
  • The fundamental frequency of the voiced speech is estimated using the instantaneous frequency based on the symmetric higher order differential energy operator. The instantaneous frequency based on the symmetric higher order energy operator shows better frequency estimation result since it is aligned to the time instance of the signal. The speech is pre-processed by a lowpass filter to remove higher frequency components. Then, it is processed by the instantaneous frequency to obtain the fundamental frequency estimates. The symmetric higher order energy operator has been used as an indicator to determine the voiced/unvoiced speech. The fundamental frequency estimates are further processed by a moving average filter to obtain the monotonically changed estimates. The obtained fundamental frequency estimates have been compared with the spectrogram of the speech to confirm its accuracy.

A Preliminary Study on Differences of Phonatory Offset-Onset between the Fluency and a Dysfluency (유창성과 비유창성 화자의 발성 종결-개시 차이에 관한 예비연구)

  • Han Ji-Yeon;Lee Ok-Bun
    • Proceedings of the KSPS conference
    • /
    • 2006.05a
    • /
    • pp.109-112
    • /
    • 2006
  • This study investigated the acoustical characteristics of phonatory offset-onset mechanisms. And this study shows the comparative results between non-stutterers (N=3) and a stutterer (N=1). Phonatory offset-onset means a laryngeal articulatory in the connected speech. In the phonetic context V_V), pattern 0(there is no changes) appeared in all subjects, and pattern 4(this indicate the trace of glottal fry and closure in spectrogram)was only in a Stutterer. In high vowels(/i/, /u/), pattern 3 and 4 appeared only in a stutterer. Although there is no common pattern among the non-stutterers, individual's preference pattern was founded. This study offers the key to an understanding of physiological movement on a block of stutter.

  • PDF

Speech Intelligibility and Sonagraphic Evaluation of Experimental Model of Obturator-type Electrolarynx (시험적 의치형 전기후두의 어음명료도 및 소나그라프 검사)

  • 김기령;홍원표;김광문;심윤주;이승철;김경수;이문재
    • Journal of the Korean Society of Laryngology, Phoniatrics and Logopedics
    • /
    • v.3 no.1
    • /
    • pp.6-12
    • /
    • 1989
  • Methods of voice rehabilitation in laryngectomees include training of esophageal speech, use of electrolarynx and pneumatic speech aid and surgical methods, etc. In this paper, we introduce the experimental model of obturator-type electrolarynx which has several advantages for use such as ease of learning, no disagreeable appearance, and both hands not being occupied. We compared it to normal voice and other voice rehabilitation methods such as esophageal voice, japanese pneumatic speech aid and cervical electrolarynx in intelligibility and sonagraphic evaluation. The results are as follows; 1) Obturator-type electrolarynx exhibited the lowest intelligibility. 2) In sonagraphic evaluation, the spectrogram produced by the obturator-type electrolarynx was the most different from those of normal voice.

  • PDF

A study on combination of loss functions for effective mask-based speech enhancement in noisy environments (잡음 환경에 효과적인 마스크 기반 음성 향상을 위한 손실함수 조합에 관한 연구)

  • Jung, Jaehee;Kim, Wooil
    • The Journal of the Acoustical Society of Korea
    • /
    • v.40 no.3
    • /
    • pp.234-240
    • /
    • 2021
  • In this paper, the mask-based speech enhancement is improved for effective speech recognition in noise environments. In the mask-based speech enhancement, enhanced spectrum is obtained by multiplying the noisy speech spectrum by the mask. The VoiceFilter (VF) model is used as the mask estimation, and the Spectrogram Inpainting (SI) technique is used to remove residual noise of enhanced spectrum. In this paper, we propose a combined loss to further improve speech enhancement. In order to effectively remove the residual noise in the speech, the positive part of the Triplet loss is used with the component loss. For the experiment TIMIT database is re-constructed using NOISEX92 noise and background music samples with various Signal to Noise Ratio (SNR) conditions. Source to Distortion Ratio (SDR), Perceptual Evaluation of Speech Quality (PESQ), and Short-Time Objective Intelligibility (STOI) are used as the metrics of performance evaluation. When the VF was trained with the mean squared error and the SI model was trained with the combined loss, SDR, PESQ, and STOI were improved by 0.5, 0.06, and 0.002 respectively compared to the system trained only with the mean squared error.

Comparison of Korean Real-time Text-to-Speech Technology Based on Deep Learning (딥러닝 기반 한국어 실시간 TTS 기술 비교)

  • Kwon, Chul Hong
    • The Journal of the Convergence on Culture Technology
    • /
    • v.7 no.1
    • /
    • pp.640-645
    • /
    • 2021
  • The deep learning based end-to-end TTS system consists of Text2Mel module that generates spectrogram from text, and vocoder module that synthesizes speech signals from spectrogram. Recently, by applying deep learning technology to the TTS system the intelligibility and naturalness of the synthesized speech is as improved as human vocalization. However, it has the disadvantage that the inference speed for synthesizing speech is very slow compared to the conventional method. The inference speed can be improved by applying the non-autoregressive method which can generate speech samples in parallel independent of previously generated samples. In this paper, we introduce FastSpeech, FastSpeech 2, and FastPitch as Text2Mel technology, and Parallel WaveGAN, Multi-band MelGAN, and WaveGlow as vocoder technology applying non-autoregressive method. And we implement them to verify whether it can be processed in real time. Experimental results show that by the obtained RTF all the presented methods are sufficiently capable of real-time processing. And it can be seen that the size of the learned model is about tens to hundreds of megabytes except WaveGlow, and it can be applied to the embedded environment where the memory is limited.

A Novel Approach to a Robust A Priori SNR Estimator in Speech Enhancement (음성 향상에서 강인한 새로운 선행 SNR 추정 기법에 관한 연구)

  • Park, Yun-Sik;Chang, Joon-Hyuk
    • The Journal of the Acoustical Society of Korea
    • /
    • v.25 no.8
    • /
    • pp.383-388
    • /
    • 2006
  • This Paper presents a novel approach to single channel microphone speech enhancement in noisy environments. Widely used noise reduction techniques based on the spectral subtraction are generally expressed as a spectral gam depending on the signal-to-noise ratio (SNR). The well-known decision-directed(DD) estimator of Ephraim and Malah efficiently reduces musical noise under the background noise conditions, but generates the delay of the a prioiri SNR because the DD weights the speech spectrum component of the Previous frame in the speech signal. Therefore, the noise suppression gain which is affected by the delay of the a priori SNR, which is estimated by the DD matches the previous frame rather than the current one, so after noise suppression. this degrades the noise reduction performance during speech transient periods. We propose a computationally simple but effective speech enhancement technique based on the sigmoid type function for the weight Parameter of the DD. The proposed approach solves the delay problem about the main parameter, the a priori SNR of the DD while maintaining the benefits of the DD. Performances of the proposed enhancement algorithm are evaluated by ITU-T p.862 Perceptual Evaluation of Speech duality (PESQ). the Mean Opinion Score (MOS) and the speech spectrogram under various noise environments and yields better results compared with the fixed weight parameter of the DD.

AM-FM Decomposition and Estimation of Instantaneous Frequency and Instantaneous Amplitude of Speech Signals for Natural Human-robot Interaction (자연스런 인간-로봇 상호작용을 위한 음성 신호의 AM-FM 성분 분해 및 순간 주파수와 순간 진폭의 추정에 관한 연구)

  • Lee, He-Young
    • Speech Sciences
    • /
    • v.12 no.4
    • /
    • pp.53-70
    • /
    • 2005
  • A Vowel of speech signals are multicomponent signals composed of AM-FM components whose instantaneous frequency and instantaneous amplitude are time-varying. The changes of emotion states cause the variation of the instantaneous frequencies and the instantaneous amplitudes of AM-FM components. Therefore, it is important to estimate exactly the instantaneous frequencies and the instantaneous amplitudes of AM-FM components for the extraction of key information representing emotion states and changes in speech signals. In tills paper, firstly a method decomposing speech signals into AM - FM components is addressed. Secondly, the fundamental frequency of vowel sound is estimated by the simple method based on the spectrogram. The estimate of the fundamental frequency is used for decomposing speech signals into AM-FM components. Thirdly, an estimation method is suggested for separation of the instantaneous frequencies and the instantaneous amplitudes of the decomposed AM - FM components, based on Hilbert transform and the demodulation property of the extended Fourier transform. The estimates of the instantaneous frequencies and the instantaneous amplitudes can be used for modification of the spectral distribution and smooth connection of two words in the speech synthesis systems based on a corpus.

  • PDF