• Title/Summary/Keyword: speech features

Search Result 652, Processing Time 0.029 seconds

EFFICIENCY OF SPEECH FEATURES (음성 특징의 효율성)

  • 황규웅
    • Proceedings of the Acoustical Society of Korea Conference
    • /
    • 1995.06a
    • /
    • pp.225-227
    • /
    • 1995
  • This paper compared waveform, cepstrum, and spline wavelet features with nonlinear discriminant analysis. This measure shows efficiency of speech parametrization better than old linear separability criteria and can be used to measure the efficiency of each layer of certain system. Spline wavelet transform has larger gap among classes and cepstrum is clustered better than the spline wavelet feature. Both features do not have good property for classification and we will compare Gabor wavelet transform, Mel cepstrum, delta cepstrum, etc.

  • PDF

The Study of Prosodic Features in Korean Topic Constructions (한국어 화제구문의 운율적 고찰)

  • Hwang, Son-Moon
    • Speech Sciences
    • /
    • v.9 no.2
    • /
    • pp.59-68
    • /
    • 2002
  • This paper analyzes the prosodic features distinctively associated with Korean topic constructions (marked by nun or its variant un) and subject constructions (marked by ka or its variant i) as a way of explicating the role that prosody plays in differentially constituting their discourse messages. Using both spoken data elicited in controlled settings and spontaneous conversational data, an attempt is made to identify differentiating prosodic features and intonation contours associated with distinct meanings and functions of nun- and ka-constructions evoked in a variety of discourse contexts.

  • PDF

A Study on Emotion Recognition of Chunk-Based Time Series Speech (청크 기반 시계열 음성의 감정 인식 연구)

  • Hyun-Sam Shin;Jun-Ki Hong;Sung-Chan Hong
    • Journal of Internet Computing and Services
    • /
    • v.24 no.2
    • /
    • pp.11-18
    • /
    • 2023
  • Recently, in the field of Speech Emotion Recognition (SER), many studies have been conducted to improve accuracy using voice features and modeling. In addition to modeling studies to improve the accuracy of existing voice emotion recognition, various studies using voice features are being conducted. This paper, voice files are separated by time interval in a time series method, focusing on the fact that voice emotions are related to time flow. After voice file separation, we propose a model for classifying emotions of speech data by extracting speech features Mel, Chroma, zero-crossing rate (ZCR), root mean square (RMS), and mel-frequency cepstrum coefficients (MFCC) and applying them to a recurrent neural network model used for sequential data processing. As proposed method, voice features were extracted from all files using 'librosa' library and applied to neural network models. The experimental method compared and analyzed the performance of models of recurrent neural network (RNN), long short-term memory (LSTM) and gated recurrent unit (GRU) using the Interactive emotional dyadic motion capture Interactive Emotional Dyadic Motion Capture (IEMOCAP) english dataset.

Combining multi-task autoencoder with Wasserstein generative adversarial networks for improving speech recognition performance (음성인식 성능 개선을 위한 다중작업 오토인코더와 와설스타인식 생성적 적대 신경망의 결합)

  • Kao, Chao Yuan;Ko, Hanseok
    • The Journal of the Acoustical Society of Korea
    • /
    • v.38 no.6
    • /
    • pp.670-677
    • /
    • 2019
  • As the presence of background noise in acoustic signal degrades the performance of speech or acoustic event recognition, it is still challenging to extract noise-robust acoustic features from noisy signal. In this paper, we propose a combined structure of Wasserstein Generative Adversarial Network (WGAN) and MultiTask AutoEncoder (MTAE) as deep learning architecture that integrates the strength of MTAE and WGAN respectively such that it estimates not only noise but also speech features from noisy acoustic source. The proposed MTAE-WGAN structure is used to estimate speech signal and the residual noise by employing a gradient penalty and a weight initialization method for Leaky Rectified Linear Unit (LReLU) and Parametric ReLU (PReLU). The proposed MTAE-WGAN structure with the adopted gradient penalty loss function enhances the speech features and subsequently achieve substantial Phoneme Error Rate (PER) improvements over the stand-alone Deep Denoising Autoencoder (DDAE), MTAE, Redundant Convolutional Encoder-Decoder (R-CED) and Recurrent MTAE (RMTAE) models for robust speech recognition.

Emotion recognition from speech using Gammatone auditory filterbank

  • Le, Ba-Vui;Lee, Young-Koo;Lee, Sung-Young
    • Proceedings of the Korean Information Science Society Conference
    • /
    • 2011.06a
    • /
    • pp.255-258
    • /
    • 2011
  • An application of Gammatone auditory filterbank for emotion recognition from speech is described in this paper. Gammatone filterbank is a bank of Gammatone filters which are used as a preprocessing stage before applying feature extraction methods to get the most relevant features for emotion recognition from speech. In the feature extraction step, the energy value of output signal of each filter is computed and combined with other of all filters to produce a feature vector for the learning step. A feature vector is estimated in a short time period of input speech signal to take the advantage of dependence on time domain. Finally, in the learning step, Hidden Markov Model (HMM) is used to create a model for each emotion class and recognize a particular input emotional speech. In the experiment, feature extraction based on Gammatone filterbank (GTF) shows the better outcomes in comparison with features based on Mel-Frequency Cepstral Coefficient (MFCC) which is a well-known feature extraction for speech recognition as well as emotion recognition from speech.

Proposed Efficient Architectures and Design Choices in SoPC System for Speech Recognition

  • Trang, Hoang;Hoang, Tran Van
    • Journal of IKEEE
    • /
    • v.17 no.3
    • /
    • pp.241-247
    • /
    • 2013
  • This paper presents the design of a System on Programmable Chip (SoPC) based on Field Programmable Gate Array (FPGA) for speech recognition in which Mel-Frequency Cepstral Coefficients (MFCC) for speech feature extraction and Vector Quantization for recognition are used. The implementing process of the speech recognition system undergoes the following steps: feature extraction, training codebook, recognition. In the first step of feature extraction, the input voice data will be transformed into spectral components and extracted to get the main features by using MFCC algorithm. In the recognition step, the obtained spectral features from the first step will be processed and compared with the trained components. The Vector Quantization (VQ) is applied in this step. In our experiment, Altera's DE2 board with Cyclone II FPGA is used to implement the recognition system which can recognize 64 words. The execution speed of the blocks in the speech recognition system is surveyed by calculating the number of clock cycles while executing each block. The recognition accuracies are also measured in different parameters of the system. These results in execution speed and recognition accuracy could help the designer to choose the best configurations in speech recognition on SoPC.

Treatment of Velopharyngeal Insufficiency in Kabuki Syndrome: Case Report (가부키 증후군 환자의 구개인두부전증의 치료: 증례보고)

  • Lee, San-Ha;Wang, Jae-Kwon;Park, Mi-Kyong;Baek, Rong-Min
    • Archives of Plastic Surgery
    • /
    • v.38 no.2
    • /
    • pp.203-206
    • /
    • 2011
  • Purpose: Kabuki syndrome is a multiple malformation syndrome that was first reported in Japan. It is characterized by distinctive Kabuki-like facial features, skeletal anomalies, dermatoglyphic abnormalities, short stature, and mental retardation. We report two cases of Kabuki syndrome with the surgical intervention and speech evaluation. Methods: Both patients had velopharyngeal insufficiency and had a superior based pharyngeal flap operation. The preoperative and postoperative speech evaluations were performed by a speech language pathologist. Results: In case 1, hypernasality was reduced in spontaneous speech, and the nasalance scores in syllable repetitions were reduced to be within normal ranges. In case 2, hypernasality in spontaneous speech was reduced from severe level to moderate level and the nasalance scores in syllable repetitions were also reduced to be within normal ranges. Conclusion: The goal of this article is to raise awareness among plastic surgeons who may encounter such patients with unique facial features. This study shows that pharyngeal flap operation can successfully correct the velopharyngeal insufficiency in Kabuki syndrome and post operative speech therapy plays a role in reinforcing surgical result.

Differentiation of Aphasic Patients from the Normal Control Via a Computational Analysis of Korean Utterances

  • Kim, HyangHee;Choi, Ji-Myoung;Kim, Hansaem;Baek, Ginju;Kim, Bo Seon;Seo, Sang Kyu
    • International Journal of Contents
    • /
    • v.15 no.1
    • /
    • pp.39-51
    • /
    • 2019
  • Spontaneous speech provides rich information defining the linguistic characteristics of individuals. As such, computational analysis of speech would enhance the efficiency involved in evaluating patients' speech. This study aims to provide a method to differentiate the persons with and without aphasia based on language usage. Ten aphasic patients and their counterpart normal controls participated, and they were all tasked to describe a set of given words. Their utterances were linguistically processed and compared to each other. Computational analyses from PCA (Principle Component Analysis) to machine learning were conducted to select the relevant linguistic features, and consequently to classify the two groups based on the features selected. It was found that functional words, not content words, were the main differentiator of the two groups. The most viable discriminators were demonstratives, function words, sentence final endings, and postpositions. The machine learning classification model was found to be quite accurate (90%), and to impressively be stable. This study is noteworthy as it is the first attempt that uses computational analysis to characterize the word usage patterns in Korean aphasic patients, thereby discriminating from the normal group.

Combination Tandem Architecture with Segmental Features for Robust Speech Recognition (강인한 음성 인식을 위한 탠덤 구조와 분절 특징의 결합)

  • Yun, Young-Sun;Lee, Yun-Keun
    • MALSORI
    • /
    • no.62
    • /
    • pp.113-131
    • /
    • 2007
  • It is reported that the segmental feature based recognition system shows better results than conventional feature based system in the previous studies. On the other hand, the various studies of combining neural network and hidden Markov models within a single system are done with expectations that it may potentially combine the advantages of both systems. With the influence of these studies, tandem approach was presented to use neural network as the classifier and hidden Markov models as the decoder. In this paper, we applied the trend information of segmental features to tandem architecture and used posterior probabilities, which are the output of neural network, as inputs of recognition system. The experiments are performed on Auroral database to examine the potentiality of the trend feature based tandem architecture. From the results, the proposed system outperforms on very low SNR environments. Consequently, we argue that the trend information on tandem architecture can be additionally used for traditional MFCC features.

  • PDF

The Role of Prosody in Dialect Synthesis and Authentication

  • Yoon, Kyu-Chul
    • Phonetics and Speech Sciences
    • /
    • v.1 no.1
    • /
    • pp.25-31
    • /
    • 2009
  • The purpose of this paper is to examine the viability of synthesizing Masan dialect with Seoul dialect and to examine the role of prosody in the authentication of the synthesized Masan dialect. The synthesis was performed by transferring one or more of the prosodic features of the Masan utterance onto the Seoul utterance. The hypothesis is that, given an utterance composed of the phonemes shared by both dialects, as more prosodic features of the Masan utterance are transferred onto the Seoul utterance, the Seoul utterance will be identified as more authentic Masan utterance. The prosodic features involved were the fundamental frequency contour, the segmental durations, and the intensity contour. The synthesized Masan utterances were evaluated by thirteen native speakers of Masan dialect. The result showed that the fundamental frequency contour and the segmental durations had main effects on the perceptual shift from Seoul to Masan dialect.

  • PDF