Search | Korea Science

Speech/Music Signal Classification Based on Spectrum Flux and MFCC For Audio Coder (오디오 부호화기를 위한 스펙트럼 변화 및 MFCC 기반 음성/음악 신호 분류)

Sangkil Lee;In-Sung Lee
- The Journal of Korea Institute of Information, Electronics, and Communication Technology
- /
- v.16 no.5
- /
- pp.239-246
- /
- 2023
In this paper, we propose an open-loop algorithm to classify speech and music signals using the spectral flux parameters and Mel Frequency Cepstral Coefficients(MFCC) parameters for the audio coder. To increase responsiveness, the MFCC was used as a short-term feature parameter and spectral fluxes were used as a long-term feature parameters to improve accuracy. The overall voice/music signal classification decision is made by combining the short-term classification method and the long-term classification method. The Gaussian Mixed Model (GMM) was used for pattern recognition and the optimal GMM parameters were extracted using the Expectation Maximization (EM) algorithm. The proposed long-term and short-term combined speech/music signal classification method showed an average classification error rate of 1.5% on various audio sound sources, and improved the classification error rate by 0.9% compared to the short-term single classification method and 0.6% compared to the long-term single classification method. The proposed speech/music signal classification method was able to improve the classification error rate performance by 9.1% in percussion music signals with attacks and 5.8% in voice signals compared to the Unified Speech Audio Coding (USAC) audio classification method.
https://doi.org/10.17661/jkiiect.2023.16.5.239 인용 PDF HTML

Emotion Recognition Based on Frequency Analysis of Speech Signal

Sim, Kwee-Bo;Park, Chang-Hyun;Lee, Dong-Wook;Joo, Young-Hoon
- International Journal of Fuzzy Logic and Intelligent Systems
- /
- v.2 no.2
- /
- pp.122-126
- /
- 2002
In this study, we find features of 3 emotions (Happiness, Angry, Surprise) as the fundamental research of emotion recognition. Speech signal with emotion has several elements. That is, voice quality, pitch, formant, speech speed, etc. Until now, most researchers have used the change of pitch or Short-time average power envelope or Mel based speech power coefficients. Of course, pitch is very efficient and informative feature. Thus we used it in this study. As pitch is very sensitive to a delicate emotion, it changes easily whenever a man is at different emotional state. Therefore, we can find the pitch is changed steeply or changed with gentle slope or not changed. And, this paper extracts formant features from speech signal with emotion. Each vowels show that each formant has similar position without big difference. Based on this fact, in the pleasure case, we extract features of laughter. And, with that, we separate laughing for easy work. Also, we find those far the angry and surprise.
https://doi.org/10.5391/IJFIS.2002.2.2.122 인용 PDF KSCI

Design of a variable rate speech codec for the W-CDMA system (W-CDMA 시스템을 위한 가변율 음성코덱 설계)

정우성
- Proceedings of the Acoustical Society of Korea Conference
- /
- 1998.08a
- /
- pp.142-147
- /
- 1998
Recently, 8 kb/s CS-ACELP coder of G.729 is atandardized by ITU-T SG15 and it has been reported that the speech quality of G729 is better than or equal to that of 32kb/s ADPCM. However G.729 is the fixed rate speech coder, and it does not consider the property of voice activity in mutual conversation. If we use the voice activity, we can reduce the average bit rate in half without any degradations of the speech quality. In this paper, we propose an efficient variable rate algorithm for G.729. The variable rate algorithm consists of two main subjects, the rate determination algorithm and algorithm, we combine the energy-thresholding method, the phonetic segmentation method by integration of various feature parameters obtained through the analysis procedure, and the variable hangover period method. Through the analysis of noise features, the 1 kb/s sub rate coder is designed for coding the background noise signal. So, we design the 4 kb/s sub rate coder for the unvoiced parts. The performance of the variable rate algorithm is evaluated by the comparison of speed quality and average bit rate with G.729. Subjective quality test is also done by MOS test. Conclusively, it is verified that the proposed variable rate CS-ACELP coder produced the same speech quality as G.729, at the average bit rate of 4.4 kb/s.
PDF

Implementation of a Robust Speech Recognizer in Noisy Car Environment Using a DSP (DSP를 이용한 자동차 소음에 강인한 음성인식기 구현)

Chung, Ik-Joo
- Speech Sciences
- /
- v.15 no.2
- /
- pp.67-77
- /
- 2008
In this paper, we implemented a robust speech recognizer using the TMS320VC33 DSP. For this implementation, we had built speech and noise database suitable for the recognizer using spectral subtraction method for noise removal. The recognizer has an explicit structure in aspect that a speech signal is enhanced through spectral subtraction before endpoints detection and feature extraction. This helps make the operation of the recognizer clear and build HMM models which give minimum model-mismatch. Since the recognizer was developed for the purpose of controlling car facilities and voice dialing, it has two recognition engines, speaker independent one for controlling car facilities and speaker dependent one for voice dialing. We adopted a conventional DTW algorithm for the latter and a continuous HMM for the former. Though various off-line recognition test, we made a selection of optimal conditions of several recognition parameters for a resource-limited embedded recognizer, which led to HMM models of the three mixtures per state. The car noise added speech database is enhanced using spectral subtraction before HMM parameter estimation for reducing model-mismatch caused by nonlinear distortion from spectral subtraction. The hardware module developed includes a microcontroller for host interface which processes the protocol between the DSP and a host.
PDF

Investigation of Timbre-related Music Feature Learning using Separated Vocal Signals (분리된 보컬을 활용한 음색기반 음악 특성 탐색 연구)

Lee, Seungjin
- Journal of Broadcast Engineering
- /
- v.24 no.6
- /
- pp.1024-1034
- /
- 2019
Preference for music is determined by a variety of factors, and identifying characteristics that reflect specific factors is important for music recommendations. In this paper, we propose a method to extract the singing voice related music features reflecting various musical characteristics by using a model learned for singer identification. The model can be trained using a music source containing a background accompaniment, but it may provide degraded singer identification performance. In order to mitigate this problem, this study performs a preliminary work to separate the background accompaniment, and creates a data set composed of separated vocals by using the proven model structure that appeared in SiSEC, Signal Separation and Evaluation Campaign. Finally, we use the separated vocals to discover the singing voice related music features that reflect the singer's voice. We compare the effects of source separation against existing methods that use music source without source separation.
https://doi.org/10.5909/JBE.2019.24.6.1024 인용 PDF KSCI KPUBS

Analysis of Voice Signal Feature of the Korean Representative a Movie Actoress 4 (대한민국 대표 여성 영화배우 4인의 음성적 특징 분석)

Kim, Bong-Hyun;Lee, Se-Hwan;Ka, Min-Kyoung;Cho, Dong-Uk
- Proceedings of the KAIS Fall Conference
- /
- 2009.12a
- /
- pp.723-726
- /
- 2009
영화산업은 삶의 질을 향상시키고 있는 현대 사회의 시대적 반영에 부응하는 문화산업으로 많은 관심을 받고 있다. 이러한 영화산업에서 흥행의 성공여부는 영화산업의 발전과 결부되는 부분이라 매우 중요하게 인식하고 있다. 흥행을 좌우하는 지표에는 다양한 요소들이 존재하며 그 중에서 주연배우의 특징과 영화의 본질이 맞을 때 성공적인 영화라 평가할 수 있는 일부분의 지표라 할 수 있다. 따라서 본 논문에서는 한국 영화에서 대표적인 흥행 마술사라 불리우고 있는 여성 영화배우 4인에 대한 음성분석을 통해 영화의 성공에 미치는 요소들과의 상호 연관성을 분석하였다. 이를 위해 대표적 여성 영화배우인 김혜수, 엄정화, 전도연, 문소리의 음성 분석을 통해 이들이 주연으로 출연한 영화의 흥행과 상관관계를 분석하였다.
PDF

Design of ONU for EPON Based Access Network (EPON 액세스 망 기반의 ONU 설계)

김용태;신동범;이형섭
- Proceedings of the IEEK Conference
- /
- 2003.07a
- /
- pp.633-636
- /
- 2003
An Ethernet passive optical network(EPON) is a point-to-multipoint optical network. EPONs leverage the low cost, high performance curve of Ethernet systems to provide reliable data, voice and video to end user at bandwidths far exceeding current access technologies. In this paper, we propose the economical and flexible structure of optical network unit(ONU) converting optical format traffic to the customer's desired format(Ethernet, VDSL, T1, IP multicast, etc.). A unique feature of EPONs is that in addition to terminating and converting the optical signal the ONU provide Layer 2-3 switching functionality, which allows internal routing of enterprise traffic at the ONU.
PDF

A study on speech training aids for Deafs (청각장애자용 발음훈련기기 개발에 관한 연구)

Ahn, Sang-Pil;Lee, Jae-Hyuk;Yoon, Tae-Sung;Park, Sang-Hui
- Proceedings of the KIEE Conference
- /
- 1990.07a
- /
- pp.47-50
- /
- 1990
Deafs cannot speak straight voice as normal people in lack of feedback of their pronunciation, therefore speech training is required. In this study, fundamental frequency, intensity, formant frequencies, vocal tract graphic and vocal tract area function, extracted from speech signal, are used as feature parameter. AR model, whose coefficients are extracted using inverse filtering. is used as speech generation model. In connect ion between vocal tract graphic and speech parameter, articulation distances and articulation distance functions in selected 15-intervals are determined by extracted vocal tract areas and formant frequencies.
PDF

Audio and Video Bimodal Emotion Recognition in Social Networks Based on Improved AlexNet Network and Attention Mechanism

Liu, Min;Tang, Jun
- Journal of Information Processing Systems
- /
- v.17 no.4
- /
- pp.754-771
- /
- 2021
In the task of continuous dimension emotion recognition, the parts that highlight the emotional expression are not the same in each mode, and the influences of different modes on the emotional state is also different. Therefore, this paper studies the fusion of the two most important modes in emotional recognition (voice and visual expression), and proposes a two-mode dual-modal emotion recognition method combined with the attention mechanism of the improved AlexNet network. After a simple preprocessing of the audio signal and the video signal, respectively, the first step is to use the prior knowledge to realize the extraction of audio characteristics. Then, facial expression features are extracted by the improved AlexNet network. Finally, the multimodal attention mechanism is used to fuse facial expression features and audio features, and the improved loss function is used to optimize the modal missing problem, so as to improve the robustness of the model and the performance of emotion recognition. The experimental results show that the concordance coefficient of the proposed model in the two dimensions of arousal and valence (concordance correlation coefficient) were 0.729 and 0.718, respectively, which are superior to several comparative algorithms.
https://doi.org/10.3745/JIPS.02.0161 인용 PDF KSCI

Deep Learning based Raw Audio Signal Bandwidth Extension System (딥러닝 기반 음향 신호 대역 확장 시스템)

Kim, Yun-Su;Seok, Jong-Won
- Journal of IKEEE
- /
- v.24 no.4
- /
- pp.1122-1128
- /
- 2020
Bandwidth Extension refers to restoring and expanding a narrow band signal(NB) that is damaged or damaged in the encoding and decoding process due to the lack of channel capacity or the characteristics of the codec installed in the mobile communication device. It means converting to a wideband signal(WB). Bandwidth extension research mainly focuses on voice signals and converts high bands into frequency domains, such as SBR (Spectral Band Replication) and IGF (Intelligent Gap Filling), and restores disappeared or damaged high bands based on complex feature extraction processes. In this paper, we propose a model that outputs an bandwidth extended signal based on an autoencoder among deep learning models, using the residual connection of one-dimensional convolutional neural networks (CNN), the bandwidth is extended by inputting a time domain signal of a certain length without complicated pre-processing. In addition, it was confirmed that the damaged high band can be restored even by training on a dataset containing various types of sound sources including music that is not limited to the speech.
https://doi.org/10.7471/ikeee.2020.24.4.1122 인용 PDF KSCI

Search Result 52, Processing Time 0.027 seconds

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)