• Title/Summary/Keyword: speech emotion recognition

Search Result 135, Processing Time 0.025 seconds

Speech emotion recognition through time series classification (시계열 데이터 분류를 통한 음성 감정 인식)

  • Kim, Gi-duk;Kim, Mi-sook;Lee, Hack-man
    • Proceedings of the Korean Society of Computer Information Conference
    • /
    • 2021.07a
    • /
    • pp.11-13
    • /
    • 2021
  • 본 논문에서는 시계열 데이터 분류를 통한 음성 감정 인식을 제안한다. mel-spectrogram을 사용하여 음성파일에서 특징을 뽑아내 다변수 시계열 데이터로 변환한다. 이를 Conv1D, GRU, Transformer를 결합한 딥러닝 모델에 학습시킨다. 위의 딥러닝 모델에 음성 감정 인식 데이터 세트인 TESS, SAVEE, RAVDESS, EmoDB에 적용하여 각각의 데이터 세트에서 기존의 모델 보다 높은 정확도의 음성 감정 분류 결과를 얻을 수 있었다. 정확도는 99.60%, 99.32%, 97.28%, 99.86%를 얻었다.

  • PDF

Analyzing the element of emotion recognition from speech (음성으로부터 감성인식 요소분석)

  • 심귀보;박창현
    • Journal of the Korean Institute of Intelligent Systems
    • /
    • v.11 no.6
    • /
    • pp.510-515
    • /
    • 2001
  • Generally, there are (1)Words for conversation (2)Tone (3)Pitch (4)Formant frequency (5)Speech speed, etc as the element for emotional recognition from speech signal. For human being, it is natural that the tone, vice quality, speed words are easier elements rather than frequency to perceive other s feeling. Therefore, the former things are important elements fro classifying feelings. And, previous methods have mainly used the former thins but using formant is good for implementing as machine. Thus. our final goal of this research is to implement an emotional recognition system based on pitch, formant, speech speed, etc. from speech signal. In this paper, as first stage we foun specific features of feeling angry from his words when a man got angry.

  • PDF

Emotion Recognition of Low Resource (Sindhi) Language Using Machine Learning

  • Ahmed, Tanveer;Memon, Sajjad Ali;Hussain, Saqib;Tanwani, Amer;Sadat, Ahmed
    • International Journal of Computer Science & Network Security
    • /
    • v.21 no.8
    • /
    • pp.369-376
    • /
    • 2021
  • One of the most active areas of research in the field of affective computing and signal processing is emotion recognition. This paper proposes emotion recognition of low-resource (Sindhi) language. This work's uniqueness is that it examines the emotions of languages for which there is currently no publicly accessible dataset. The proposed effort has provided a dataset named MAVDESS (Mehran Audio-Visual Dataset Mehran Audio-Visual Database of Emotional Speech in Sindhi) for the academic community of a significant Sindhi language that is mainly spoken in Pakistan; however, no generic data for such languages is accessible in machine learning except few. Furthermore, the analysis of various emotions of Sindhi language in MAVDESS has been carried out to annotate the emotions using line features such as pitch, volume, and base, as well as toolkits such as OpenSmile, Scikit-Learn, and some important classification schemes such as LR, SVC, DT, and KNN, which will be further classified and computed to the machine via Python language for training a machine. Meanwhile, the dataset can be accessed in future via https://doi.org/10.5281/zenodo.5213073.

Text-driven Speech Animation with Emotion Control

  • Chae, Wonseok;Kim, Yejin
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.14 no.8
    • /
    • pp.3473-3487
    • /
    • 2020
  • In this paper, we present a new approach to creating speech animation with emotional expressions using a small set of example models. To generate realistic facial animation, two example models called key visemes and expressions are used for lip-synchronization and facial expressions, respectively. The key visemes represent lip shapes of phonemes such as vowels and consonants while the key expressions represent basic emotions of a face. Our approach utilizes a text-to-speech (TTS) system to create a phonetic transcript for the speech animation. Based on a phonetic transcript, a sequence of speech animation is synthesized by interpolating the corresponding sequence of key visemes. Using an input parameter vector, the key expressions are blended by a method of scattered data interpolation. During the synthesizing process, an importance-based scheme is introduced to combine both lip-synchronization and facial expressions into one animation sequence in real time (over 120Hz). The proposed approach can be applied to diverse types of digital content and applications that use facial animation with high accuracy (over 90%) in speech recognition.

The Role of Cognitive Control in Tinnitus and Its Relation to Speech-in-Noise Performance

  • Tai, Yihsin;Husain, Fatima T.
    • Korean Journal of Audiology
    • /
    • v.23 no.1
    • /
    • pp.1-7
    • /
    • 2019
  • Self-reported difficulties in speech-in-noise (SiN) recognition are common among tinnitus patients. Whereas hearing impairment that usually co-occurs with tinnitus can explain such difficulties, recent studies suggest that tinnitus patients with normal hearing sensitivity still show decreased SiN understanding, indicating that SiN difficulties cannot be solely attributed to changes in hearing sensitivity. In fact, cognitive control, which refers to a variety of top-down processes that human beings use to complete their daily tasks, has been shown to be critical for SiN recognition, as well as the key to understand cognitive inefficiencies caused by tinnitus. In this article, we review studies investigating the association between tinnitus and cognitive control using behavioral and brain imaging assessments, as well as those examining the effect of tinnitus on SiN recognition. In addition, three factors that can affect cognitive control in tinnitus patients, including hearing sensitivity, age, and severity of tinnitus, are discussed to elucidate the association among tinnitus, cognitive control, and SiN recognition. Although a possible central or cognitive involvement has always been postulated in the observed SiN impairments in tinnitus patients, there is as yet no direct evidence to underpin this assumption, as few studies have addressed both SiN performance and cognitive control in one tinnitus cohort. Future studies should aim at incorporating SiN tests with various subjective and objective methods that evaluate cognitive performance to better understand the relationship between SiN difficulties and cognitive control in tinnitus patients.

Design of Emotion Recognition system utilizing fusion of Speech and Context based emotion recognition in Smartphone (스마트폰에서 음성과 컨텍스트 기반 감정인식 융합을 활용한 감정인식 시스템 설계)

  • Cho, Seong Jin;Lee, Seongho;Lee, Sungyoung
    • Proceedings of the Korean Society of Computer Information Conference
    • /
    • 2012.07a
    • /
    • pp.323-324
    • /
    • 2012
  • 최근 스마트폰 환경에서 제공되는 수많은 서비스들은 일률적으로 소비자에게 단방향으로 제공되던 예전과 달리 사용자마다 개인화된 서비스 제공을 통해, 더욱 효율적으로 서비스를 제공하려는 시도들이 이루어지고 있다. 그 중에서 감정인식을 이용한 연구는 사용자에게 최적화된 개인화 서비스 제공을 위해 사용자의 감정을 인식하여 사용자가 느끼는 감정에 맞는 서비스를 제공함으로써 보다 몰입감을 느낄 수 있도록 하여 결과적으로 특정 서비스의 이용을 유도하도록 할 수 있다. 본 논문에서는 사용자 선호도와 컨텍스트 정보를 활용하여 사용자의 감정을 추출하고 이를 음성기반 감정인식과 융합하여 그 정확도를 높이고 실제서비스에서 활용할 수 있는 시스템 설계를 제안한다. 제안하는 시스템은 사용자 선호도와 컨텍스트 인식으로 감정을 판단했을 경우의 오류를 음성을 통한 감정인식으로 보완하며, 사용자가 감정인식 시스템을 활용하기 위한 비용을 최소화한다. 제안하는 시스템은 스마트폰에서 사용자 감정을 이용한 애플리케이션이나 추천서비스 등에서 활용이 가능하다.

  • PDF

The Relationship Between Voice and the Image Triggered by the Voice: American Speakers and American Listeners (목소리를 듣고 감지하는 인상에 대한 연구: 미국인화자와 미국인청자)

  • Moon, Seung-Jae
    • Phonetics and Speech Sciences
    • /
    • v.1 no.2
    • /
    • pp.111-118
    • /
    • 2009
  • The present study aims at investigating the relationship between voices and the physical images triggered by the voices. It is the final part of a four-part series and the results reported in the present study are limited to those of American speakers and American listeners. Combined with the results from previous studies (Moon, 2000; Moon, 2002; Tak, 2005), the results suggest that (1) there is a very strong, much higher than chance-level relationship between voices and the pictures chosen for the voices by the perception experiment subjects; (2) the more physical characteristics that are given, the better the chance for correctly matching voices with pictures; and (3) culture (in the present, language environment) seems to play a role in conjuring up the mental images from voices.

  • PDF

Utilizing Korean Ending Boundary Tones for Accurately Recognizing Emotions in Utterances (발화 내 감정의 정밀한 인식을 위한 한국어 문미억양의 활용)

  • Jang In-Chang;Lee Tae-Seung;Park Mikyoung;Kim Tae-Soo;Jang Dong-Sik
    • The Journal of Korean Institute of Communications and Information Sciences
    • /
    • v.30 no.6C
    • /
    • pp.505-511
    • /
    • 2005
  • Autonomic machines interacting with human should have capability to perceive the states of emotion and attitude through implicit messages for obtaining voluntary cooperation from their clients. Voice is the easiest and most natural way to exchange human messages. The automatic systems capable to understanding the states of emotion and attitude have utilized features based on pitch and energy of uttered sentences. Performance of the existing emotion recognition systems can be further improved withthe support of linguistic knowledge that specific tonal section in a sentence is related with the states of emotion and attitude. In this paper, we attempt to improve recognition rate of emotion by adopting such linguistic knowledge for Korean ending boundary tones into anautomatic system implemented using pitch-related features and multilayer perceptrons. From the results of an experiment over a Korean emotional speech database, the improvement of $4\%$ is confirmed.

Analyzing the Acoustic Elements and Emotion Recognition from Speech Signal Based on DRNN (음향적 요소분석과 DRNN을 이용한 음성신호의 감성 인식)

  • Sim, Kwee-Bo;Park, Chang-Hyun;Joo, Young-Hoon
    • Journal of the Korean Institute of Intelligent Systems
    • /
    • v.13 no.1
    • /
    • pp.45-50
    • /
    • 2003
  • Recently, robots technique has been developed remarkably. Emotion recognition is necessary to make an intimate robot. This paper shows the simulator and simulation result which recognize or classify emotions by learning pitch pattern. Also, because the pitch is not sufficient for recognizing emotion, we added acoustic elements. For that reason, we analyze the relation between emotion and acoustic elements. The simulator is composed of the DRNN(Dynamic Recurrent Neural Network), Feature extraction. DRNN is a learning algorithm for pitch pattern.

Multi-Emotion Recognition Model with Text and Speech Ensemble (텍스트와 음성의 앙상블을 통한 다중 감정인식 모델)

  • Yi, Moung Ho;Lim, Myoung Jin;Shin, Ju Hyun
    • Smart Media Journal
    • /
    • v.11 no.8
    • /
    • pp.65-72
    • /
    • 2022
  • Due to COVID-19, the importance of non-face-to-face counseling is increasing as the face-to-face counseling method has progressed to non-face-to-face counseling. The advantage of non-face-to-face counseling is that it can be consulted online anytime, anywhere and is safe from COVID-19. However, it is difficult to understand the client's mind because it is difficult to communicate with non-verbal expressions. Therefore, it is important to recognize emotions by accurately analyzing text and voice in order to understand the client's mind well during non-face-to-face counseling. Therefore, in this paper, text data is vectorized using FastText after separating consonants, and voice data is vectorized by extracting features using Log Mel Spectrogram and MFCC respectively. We propose a multi-emotion recognition model that recognizes five emotions using vectorized data using an LSTM model. Multi-emotion recognition is calculated using RMSE. As a result of the experiment, the RMSE of the proposed model was 0.2174, which was the lowest error compared to the model using text and voice data, respectively.