• Title/Summary/Keyword: Speech sound error

Search Result 33, Processing Time 0.022 seconds

Speech/Music Signal Classification Based on Spectrum Flux and MFCC For Audio Coder (오디오 부호화기를 위한 스펙트럼 변화 및 MFCC 기반 음성/음악 신호 분류)

  • Sangkil Lee;In-Sung Lee
    • The Journal of Korea Institute of Information, Electronics, and Communication Technology
    • /
    • v.16 no.5
    • /
    • pp.239-246
    • /
    • 2023
  • In this paper, we propose an open-loop algorithm to classify speech and music signals using the spectral flux parameters and Mel Frequency Cepstral Coefficients(MFCC) parameters for the audio coder. To increase responsiveness, the MFCC was used as a short-term feature parameter and spectral fluxes were used as a long-term feature parameters to improve accuracy. The overall voice/music signal classification decision is made by combining the short-term classification method and the long-term classification method. The Gaussian Mixed Model (GMM) was used for pattern recognition and the optimal GMM parameters were extracted using the Expectation Maximization (EM) algorithm. The proposed long-term and short-term combined speech/music signal classification method showed an average classification error rate of 1.5% on various audio sound sources, and improved the classification error rate by 0.9% compared to the short-term single classification method and 0.6% compared to the long-term single classification method. The proposed speech/music signal classification method was able to improve the classification error rate performance by 9.1% in percussion music signals with attacks and 5.8% in voice signals compared to the Unified Speech Audio Coding (USAC) audio classification method.

Optimization of the Kernel Size in CNN Noise Attenuator (CNN 잡음 감쇠기에서 커널 사이즈의 최적화)

  • Lee, Haeng-Woo
    • The Journal of the Korea institute of electronic communication sciences
    • /
    • v.15 no.6
    • /
    • pp.987-994
    • /
    • 2020
  • In this paper, we studied the effect of kernel size of CNN layer on performance in acoustic noise attenuators. This system uses a deep learning algorithm using a neural network adaptive prediction filter instead of using the existing adaptive filter. Speech is estimated from a single input speech signal containing noise using a 100-neuron, 16-filter CNN filter and an error back propagation algorithm. This is to use the quasi-periodic property in the voiced sound section of the voice signal. In this study, a simulation program using Tensorflow and Keras libraries was written and a simulation was performed to verify the performance of the noise attenuator for the kernel size. As a result of the simulation, when the kernel size is about 16, the MSE and MAE values are the smallest, and when the size is smaller or larger than 16, the MSE and MAE values increase. It can be seen that in the case of an speech signal, the features can be best captured when the kernel size is about 16.

A study on the simplification of HRTF within low frequency region (저역 주파수 영역에서 HRTF의 간략화에 관한 연구)

  • Lee, Chai-Bong
    • The Journal of the Korea institute of electronic communication sciences
    • /
    • v.5 no.6
    • /
    • pp.581-587
    • /
    • 2010
  • In this study, we investigated the effect of the simplification for low frequency region in Head-Related Transfer Function(HRTF) on the sound localization. For this purpose, HRTF was measured and analyzed. The result in the standard deviation of HRTF showed that the directional dependence of low frequency was smaller than that of high frequency region, which means the possibility of simplification in the low frequency region. Simplification was performed by flattening of the low frequency amplitude characteristics with the insertion of the high-pass filter, whose cutoff frequency is given by boundary frequency. Auditory experiments were performed to evaluate the simplified HRTF. The result showed that direction perception was not influenced by the simplification of the frequency characteristics of HRTF for the error of sound localization. The rate of confusion for the front and back was not affected by the simplification of the frequency characteristics within 1kHz of HRTF. Finally, we made it clear that the sound localization was not affected by the simplification of frequency characteristics of HRTF within 1kHz. The result is expected to be utilized to reduce the size of speech information with no deterioration of the directional characteristics of the speech signal.

An efficient space dividing method for the two-dimensional sound source localization (2차원 상의 음원위치 추정을 위한 효율적인 영역분할방법)

  • Kim, Hwan-Yong;Choi, Hong-Sub
    • The Journal of the Acoustical Society of Korea
    • /
    • v.35 no.5
    • /
    • pp.358-367
    • /
    • 2016
  • SSL (Sound Source Localization) has been applied to several applications such as man-machine interface, video conference system, smart car and so on. But in the process of sound source localization, angle estimation error is occurred mainly due to the non-linear characteristics of the sine inverse function. So an approach was proposed to decrease the effect of this non-linear characteristics, which divides the microphone's covering space into narrow regions. In this paper, we proposed an optimal space dividing way according to the pattern of microphone array. In addition, sound source's 2-dimensional position is estimated in order to evaluate the performance of this dividing method. In the experiment, GCC-PHAT (Generalized Cross Correlation PHAse Transform) method that is known to be robust with noisy environments is adopted and triangular pattern of 3 microphones and rectangular pattern of 4 microphones are tested with 100 speech data respectively. The experimental results show that triangular pattern can't estimate the correct position due to the lower space area resolution, but performance of rectangular pattern is dramatically improved with correct estimation rate of 67 %.

A Reading Trainning Program offering Visual-Auditory Cue with Noise Cancellation Function (잡음제거 기능을 갖춘 시-청각 단서 제공 읽기 훈련 프로그램)

  • Bang, D.H.;Kang, H.D.;Kil, S.K.;Lee, S.M.
    • Journal of rehabilitation welfare engineering & assistive technology
    • /
    • v.2 no.1
    • /
    • pp.35-43
    • /
    • 2009
  • In this paper, we introduce a reading training program offering visual-auditory cue with noise cancellation function (RT program) developed by us. The RT program provides some training sentences with visual-auditory cues. Motor speech disorder patients can use the visual and/or auditory cues for reading training. To provide convenient estimation of training result, we developed a noise cancellation algorithm. The function of the algorithm is to remove noise and auditory-cues which are recorded with reading speech at the same time while patient read the sentences in PC monitor. In addition, we developed a function for finding out the first starting time of reading sound after a patient sees a sentence and begins to read the sentence. The recorded speeches are acquired from six people(three male, three female) in four noisy environments (interior noise, white noise, car interior noise, babble noise). We evaluated the timing error for starting time between original recorded speech and processed speech in condition of executing noise cancellation function and not executing. The timing error was improved as much as $4.847{\pm}2.4235[ms]$ as the effect of noise cancellation. It is expected that the developed RT program helps motor speech disorder patient in reading training and symptom evaluation.

  • PDF

Acoustic and Physiologic Characteristics of Newborn Infants' Communication Intent via Crying (신생아 울음의 의사소통 의도와 관련된 음향학적 특성)

  • Jang, Hyo-Ryung;Ko, Do-Heung
    • Phonetics and Speech Sciences
    • /
    • v.5 no.3
    • /
    • pp.55-60
    • /
    • 2013
  • The purpose of this study was to investigate the acoustic characteristics of crying infants according to the communication intents such as hunger and pain in terms of acoustic differences in the fundamental frequency ($F_0$), jitter, shimmer, noise-to-harmonic ratio(NHR), habitual pitch, and intensity. The subjects were 20 healthy, normal infants, less than seven days old, from the city of Seoul and were born after 38 to 42 weeks(full term) of pregnancy. The sound of crying was recorded for three minutes. The crying due to pain was induced by means of the inborn metabolism error test, whereas the crying due to hunger was verified by means of the rooting reflex by waiting for the designated eating time. The results were as follows: (1) the fundamental frequency, noise-to-harmonic ratio(NHR), and intensity of the infants' crying due to pain was higher than that by hunger, showing a significant difference between the mean values. (2) the infants' crying due to hunger and that by pain did not have a significant difference in the mean jitter and shimmer values but both of them were largely outside of the normal threshold values(jitter by 1.04% and shimmer by 3.81%). This study was significant in the sense that it showed the acoustic characteristics of infants' crying from hunger and pain were very different from each other according to the communication intents in terms of the six acoustic parameters.

Direction-of-Arrival Estimation of Speech Signals Based on MUSIC and Reverberation Component Reduction (MUSIC 및 반향 성분 제거 기법을 이용한 음성신호의 입사각 추정)

  • Chang, Hyungwook;Jeong, Sangbae;Kim, Youngil
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.18 no.6
    • /
    • pp.1302-1309
    • /
    • 2014
  • In this paper, we propose a method to improve the performance of the direction-of-arrival (DOA) estimation of a speech source using a multiple signal classification (MUSIC)-based algorithm. Basically, the proposed algorithm utilizes a complex coefficient band pass filter to generate the narrow band signals for signal analysis. Also, reverberation component reduction and quadratic function-based response approximation in MUSIC spatial spectrum are utilized to improve the accuracy of DOA estimation. Experimental results show that the proposed method outperforms the well-known generalized cross-correlation (GCC)-based DOA estimation algorithm in the aspect of the estimation error and success rate, respectively.Abstract should be placed here. These instructions give you guidelines for preparing papers for JICCE.

Generalized cross correlation with phase transform sound source localization combined with steered response power method (조정 응답 파워 방법과 결합된 generalized cross correlation with phase transform 음원 위치 추정)

  • Kim, Young-Joon;Oh, Min-Jae;Lee, In-Sung
    • The Journal of the Acoustical Society of Korea
    • /
    • v.36 no.5
    • /
    • pp.345-352
    • /
    • 2017
  • We propose a methods which is reducing direction estimation error of sound source in the reverberant and noisy environments. The proposed algorithm divides speech signal into voice and unvoice using VAD. We estimate the direction of source when current frame is voiced. TDOA (Time-Difference of Arrival) between microphone array using the GCC-PHAT (Generalized Cross Correlation with Phase Transform) method will be estimated in that frame. Then, we compare the peak value of cross-correlation of two signals applied to estimated time-delay with other time-delay in time-table in order to improve the accuracy of source location. If the angle of current frame is far different from before and after frame in successive voiced frame, the angle of current frame is replaced with mean value of the estimated angle in before and after frames.

An Implementation of Security System Using Speaker Recognition Algorithm (화자인식 알고리즘을 이용한 보안 시스템 구축)

  • Shin, You-Shik;Park, Kee-Young;Kim, Chong-Kyo
    • Journal of the Korean Institute of Telematics and Electronics T
    • /
    • v.36T no.4
    • /
    • pp.17-23
    • /
    • 1999
  • This paper described a security system using text-independent speaker recognition algorithm. Security system is based on PIC16F84 and sound card. Speaker recognition algorithm applied a k-means based model and weighted cepstrum for speech features. As the experimental results, recognition rate of the training data is 100%, non-training data is 99%. Also false rejection rate is 1%, false acceptance rate is 0% and verification mean error rate is 0.5% for registered 5 persons.

  • PDF

Real-time Implementation of the AMR Speech Coder Using $OakDSPCore^{\circledR}$ ($OakDSPCore^{\circledR}$를 이용한 적응형 다중 비트 (AMR) 음성 부호화기의 실시간 구현)

  • 이남일;손창용;이동원;강상원
    • The Journal of the Acoustical Society of Korea
    • /
    • v.20 no.6
    • /
    • pp.34-39
    • /
    • 2001
  • An adaptive multi-rate (AMR) speech coder was adopted as a standard of W-CDMA by 3GPP and ETSI. The AMR coder is based on the CELP algorithm operating at rates ranging from 12.2 kbps down to 4.75 kbps, and it is a source controlled codec according to the channel error conditions and the traffic loading. In this paper, we implement the DSP S/W of the AMR coder using OakDSPCore. The implementation is based on the CSD17C00A chip developed by C&S Technology, and it is tested using test vectors, for the AMR speech codec, provided by ETSI for the bit exact implementation. The DSP B/W requires 20.6 MIPS for the encoder and 2.7 MIPS for the decoder. Memories required by the Am coder were 21.97 kwords, 6.64 kwords and 15.1 kwords for code, data sections and data ROM, respectively. Also, actual sound input/output test using microphone and speaker demonstrates its proper real-time operation without distortions or delays.

  • PDF