• Title/Summary/Keyword: Speech feature

Search Result 712, Processing Time 0.035 seconds

Multimodal Parametric Fusion for Emotion Recognition

  • Kim, Jonghwa
    • International journal of advanced smart convergence
    • /
    • v.9 no.1
    • /
    • pp.193-201
    • /
    • 2020
  • The main objective of this study is to investigate the impact of additional modalities on the performance of emotion recognition using speech, facial expression and physiological measurements. In order to compare different approaches, we designed a feature-based recognition system as a benchmark which carries out linear supervised classification followed by the leave-one-out cross-validation. For the classification of four emotions, it turned out that bimodal fusion in our experiment improves recognition accuracy of unimodal approach, while the performance of trimodal fusion varies strongly depending on the individual. Furthermore, we experienced extremely high disparity between single class recognition rates, while we could not observe a best performing single modality in our experiment. Based on these observations, we developed a novel fusion method, called parametric decision fusion (PDF), which lies in building emotion-specific classifiers and exploits advantage of a parametrized decision process. By using the PDF scheme we achieved 16% improvement in accuracy of subject-dependent recognition and 10% for subject-independent recognition compared to the best unimodal results.

Moving Average Filter for Automatic Music Segmentation & Summarization (이동 평균 필터를 적용한 음악 세그멘테이션 및 요약)

  • Kim Kil-Youn;Oh Yung-Hwan
    • Proceedings of the KSPS conference
    • /
    • 2006.05a
    • /
    • pp.143-146
    • /
    • 2006
  • Music is now digitally produced and distributed via internet and we face a huge amount of music day by day. A music summarization technology has been studied in order to help people concentrate on the most impressive section of the song andone can skim a song as listening the climax(chorus, refrain) only. Recent studies try to find the climax section using various methods such as finding diagonal line segment or kernel based segmentation. All these methods fail to capture the inherent structure of music due to polyphonic and noisy nature of music. In this paper, after applying moving average filter to time domain of MFCC/chroma feature, we achieved a remarkable result to capture the music structure.

  • PDF

Pattern Recognition of Rotor Fault Signal Using Bidden Markov Model (은닉 마르코프 모형을 이용한 회전체 결함신호의 패턴 인식)

  • Lee, Jong-Min;Kim, Seung-Jong;Hwang, Yo-Ha;Song, Chang-Seop
    • Transactions of the Korean Society of Mechanical Engineers A
    • /
    • v.27 no.11
    • /
    • pp.1864-1872
    • /
    • 2003
  • Hidden Markov Model(HMM) has been widely used in speech recognition, however, its use in machine condition monitoring has been very limited despite its good potential. In this paper, HMM is used to recognize rotor fault pattern. First, we set up rotor kit under unbalance and oil whirl conditions. Time signals of two failure conditions were sampled and translated to auto power spectrums. Using filter bank, feature vectors were calculated from these auto power spectrums. Next, continuous HMM and discrete HMM were trained with scaled forward/backward variables and diagonal covariance matrix. Finally, each HMM was applied to all sampled data to prove fault recognition ability. It was found that HMM has good recognition ability despite of small number of training data set in rotor fault pattern recognition.

A Study on Feature Extraction using Wavelet Transform for Speech Recognition (웨이블렛 변환을 이용한 음성특징 추출에 관한 연구)

  • Joung Eui-jun;Chang Sung-wook;Yang Sung-il;Kwon Y.
    • Proceedings of the Acoustical Society of Korea Conference
    • /
    • autumn
    • /
    • pp.33-36
    • /
    • 2001
  • 본 논문에서는 기존의 음성인식에서 사용하는 특징벡터인 MFCC(Mel-Frequency Cepstral Cefficients)를 대신하여 웨이블렛 변환을 이용한 새로운 특징벡터를 추출하는 방법을 제안한다. 새 특징벡터로는 MRA(Multi-Resolution Analysis)를 이용하여 구성하였다. 웨이블렛 변환을 이용한 새로운 특징벡터의 추출 목적은 시간축과 주파수축에서의 더 좋은 해상도를 가지는 성질을 이용하는 것이다. 실험결과에서 웨이블렛 변환을 이용한 새로운 특징벡터를 이용한 인식이 기존의 방식보다 더 좋은 인식률을 보이고 있음을 확인하였다.

  • PDF

A Study of English Loanwords

  • Lee, Hae-Bong
    • Proceedings of the KSPS conference
    • /
    • 2000.07a
    • /
    • pp.365-365
    • /
    • 2000
  • English segments adopted into Korean can be divided into three types: Some English segments /$m, {\;}n, {\;}{\eta}, {\;}p^h, {\;}t^h, {\;}k^h$/ are adopted into the original sound [$m, {\;}n, {\;}{\eta}, {\;}p^h, {\;}t^h, {\;}k^h$] in Korean. Other segments /b, d, g/ appear in the voiceless stop form [p, t, k]. Generative Phonology explains the presence of the above English segments in Korean but it cannot explain why the English segments /$f, {\;}v, {\;}{\Theta}, {\;}{\breve{z}}, {\;}{\breve{c}}, {\;}{\breve{j}}$/ disappear during the adopting process. I present a set of universal constraints from the Optimality Theory proposed by Prince and Smolensky(l993) and I show how English segments differently adopted into Korean can be explained by these universal constraints such as Faith(feature). N oAffricateStop, Faith(nasal), NoNasalStop, Faith(voice), NoVoicedStop and the interaction of these constraints. I conclude that this Optimality Theory provides insights that better capture the nature of the phonological phenomena of English segments in Korean.

  • PDF

The Position of [lateral] in Feature Geometry

  • Jun Jongho
    • MALSORI
    • /
    • no.29_30
    • /
    • pp.95-104
    • /
    • 1995
  • 최근 음운론에서 lateral 자질이 자질수형도에서 어디에 위치하는가에 대해 두 가지 접근이 있어 왔다. Levin(1988)은 lateral이 coronal에만 나타나는 제약에 기초해서, lateral 자질이 coronal 마디의 의존자질이라고 주장한다. 이에 반해 Rice & Avery(1991), 그리고 Shaw(1991)는 lateral 자질이 자질수형도의 위쪽에 위치한다고 주장한다. 이 두 이론을 비교하기 위해 본 논문에서는 다음과 같은 내용의 음운론적인 요소들과 음성학적인 요소들을 고려한다. 첫째, 음성학에서 lateral의 기능은 lateral이 일반적으로 수형도 위쪽에 위치하는 것으로 간주되는 조음방법 자질이라는 것을 시사한다. 둘째, Papuan 언어군에서 보고된 Velar lateral의 존재는 lateral이 coronal에만 나타난다는 제약을 무효화하면서 Levin이론의 전제를 의심스럽게 한다. 셋째, 몇 가지 다른 유형의 동화 현상에 대한 논의는 동화현상이 lateral이 수형도의 위쪽에 위치하는 이론에서 더 잘 설명된다는 것을 보여 준다. 마지막으로 Chumash와 Tahltan의 coronal harmony에서 나타나는 lateral의 transparency와 Cambodian과 Javanese에서 나타나는 OCP효과 따위도 lateral이 조음위치 마디의 의존 자질인 이론에서는 설명될 수 없는 underspecified lateral의 증거를 제시한다. 이와 같은 논의에 기초해서 본 논문의 결과는 lateral이 수형도 위쪽에 위치한다는 주장이 옳음을 보여준다.

  • PDF

Feature Compensation with Model-based Estimation for Noise Masking (잡음마스킹을 이용한 환경보상기법)

  • Kim, Young-Joon;Kim, Nam-Soo;Lee, Yun-Gun
    • Proceedings of the KSPS conference
    • /
    • 2006.11a
    • /
    • pp.7-10
    • /
    • 2006
  • 본 논문에서는 음성의 모델을 이용하여 확률적인 기반으로 잡음의 마스킹 정도를 측정하는 방법에 대해서 제시한다. 잡음의 마스킹 정도를 측정하는 기준으로서 '잡음 마스킹 확률'을 구하는 방법에 대해서 설명하고 이의 특성에 대해서 알아본다. 그리고 잡음에 대한 '잡음 마스킹 확률'을 이용하여 잡음 환경에서의 음성인식 특징벡터의 성능 향상에 대해 적용해 보았다. 제안된 방법은 ETSI 에서 음성인식 표준실험으로 제시한 Aurora2 데이터베이스 상에서 실험해 보았다. 그 결과 기존의 알고리즘에 비해 16.58%의 성능 향상을 이루어 낼 수 있었다.

  • PDF

Study on the Recognition of Spoken Korean Continuous Digits Using Phone Network (음성망을 이용한 한국어 연속 숫자음 인식에 관한 연구)

  • Lee, G.S.;Lee, H.J.;Byun, Y.G.;Kim, S.H.
    • Proceedings of the KIEE Conference
    • /
    • 1988.07a
    • /
    • pp.624-627
    • /
    • 1988
  • This paper describes the implementation of recognition of speaker - dependent Korean spoken continuous digits. The recognition system can be divided into two parts, acoustic - phonetic processor and lexical decoder. Acoustic - phonetic processor calculates the feature vectors from input speech signal and the performs frame labelling and phone labelling. Frame labelling is performed by Bayesian classification method and phone labelling is performed using labelled frame and posteriori probability. The lexical decoder accepts segments (phones) from acoustic - phonetic processor and decodes its lexical structure through phone network which is constructed from phonetic representation of ten digits. The experiment carried out with two sets of 4continuous digits, each set is composed of 35 patterns. An evaluation of the system yielded a pattern accuracy of about 80 percent resulting from a word accuracy of about 95 percent.

  • PDF

Classification of Phornographic Videos Based on the Audio Information (오디오 신호에 기반한 음란 동영상 판별)

  • Kim, Bong-Wan;Choi, Dae-Lim;Lee, Yong-Ju
    • MALSORI
    • /
    • no.63
    • /
    • pp.139-151
    • /
    • 2007
  • As the Internet becomes prevalent in our lives, harmful contents, such as phornographic videos, have been increasing on the Internet, which has become a very serious problem. To prevent such an event, there are many filtering systems mainly based on the keyword-or image-based methods. The main purpose of this paper is to devise a system that classifies pornographic videos based on the audio information. We use the mel-cepstrum modulation energy (MCME) which is a modulation energy calculated on the time trajectory of the mel-frequency cepstral coefficients (MFCC) as well as the MFCC as the feature vector. For the classifier, we use the well-known Gaussian mixture model (GMM). The experimental results showed that the proposed system effectively classified 98.3% of pornographic data and 99.8% of non-pornographic data. We expect the proposed method can be applied to the more accurate classification system which uses both video and audio information.

  • PDF

Dysarthric speaker identification with different degrees of dysarthria severity using deep belief networks

  • Farhadipour, Aref;Veisi, Hadi;Asgari, Mohammad;Keyvanrad, Mohammad Ali
    • ETRI Journal
    • /
    • v.40 no.5
    • /
    • pp.643-652
    • /
    • 2018
  • Dysarthria is a degenerative disorder of the central nervous system that affects the control of articulation and pitch; therefore, it affects the uniqueness of sound produced by the speaker. Hence, dysarthric speaker recognition is a challenging task. In this paper, a feature-extraction method based on deep belief networks is presented for the task of identifying a speaker suffering from dysarthria. The effectiveness of the proposed method is demonstrated and compared with well-known Mel-frequency cepstral coefficient features. For classification purposes, the use of a multi-layer perceptron neural network is proposed with two structures. Our evaluations using the universal access speech database produced promising results and outperformed other baseline methods. In addition, speaker identification under both text-dependent and text-independent conditions are explored. The highest accuracy achieved using the proposed system is 97.3%.