• Title/Summary/Keyword: speech rates

Search Result 271, Processing Time 0.024 seconds

Eigenvoice Adaptation of Classification Model for Binary Mask Estimation (Eigenvoice를 이용한 이진 마스크 분류 모델 적응 방법)

  • Kim, Gibak
    • Journal of Broadcast Engineering
    • /
    • v.20 no.1
    • /
    • pp.164-170
    • /
    • 2015
  • This paper deals with the adaptation of classification model in the binary mask approach to suppress noise in the noisy environment. The binary mask estimation approach is known to improve speech intelligibility of noisy speech. However, the same type of noisy data for the test data should be included in the training data for building the classification model of binary mask estimation. The eigenvoice adaptation is applied to the noise-independent classification model and the adapted model is used as noise-dependent model. The results are reported in Hit rates and False alarm rates. The experimental results confirmed that the accuracy of classification is improved as the number of adaptation sentences increases.

A Temporal Decomposition Method Based on a Rate-distortion Criterion (비트율-왜곡 기반 음성 신호 시간축 분할)

  • 이기승
    • The Journal of the Acoustical Society of Korea
    • /
    • v.21 no.3
    • /
    • pp.315-322
    • /
    • 2002
  • In this paper, a new temporal decomposition method is proposed. which takes into consideration not only spectral distortion but also bit rates. The interpolation functions, which are one of necessary parameters for temporal decomposition, are obtained from the training speech corpus. Since the interval between the two targets uniquely defines the interpolation function, the interpolation can be represented without additional information. The locations of the targets are determined by minimizing the bit rates while the maximum spectral distortion maintains below a given threshold. The proposed method has been applied to compressing the LSP coefficients which are widely used as a spectral parameter. The results of the simulation show that an average spectral distortion of about 1.4 dB can be achieved at an average bit rate of about 8 bits/Frame.

A study on deep neural speech enhancement in drone noise environment (드론 소음 환경에서 심층 신경망 기반 음성 향상 기법 적용에 관한 연구)

  • Kim, Jimin;Jung, Jaehee;Yeo, Chaneun;Kim, Wooil
    • The Journal of the Acoustical Society of Korea
    • /
    • v.41 no.3
    • /
    • pp.342-350
    • /
    • 2022
  • In this paper, actual drone noise samples are collected for speech processing in disaster environments to build noise-corrupted speech database, and speech enhancement performance is evaluated by applying spectrum subtraction and mask-based speech enhancement techniques. To improve the performance of VoiceFilter (VF), an existing deep neural network-based speech enhancement model, we apply the Self-Attention operation and use the estimated noise information as input to the Attention model. Compared to existing VF model techniques, the experimental results show 3.77%, 1.66% and 0.32% improvements for Source to Distortion Ratio (SDR), Perceptual Evaluation of Speech Quality (PESQ), and Short-Time Objective Intelligence (STOI), respectively. When trained with a 75% mix of speech data with drone sounds collected from the Internet, the relative performance drop rates for SDR, PESQ, and STOI are 3.18%, 2.79% and 0.96%, respectively, compared to using only actual drone noise. This confirms that data similar to real data can be collected and effectively used for model training for speech enhancement in environments where real data is difficult to obtain.

Automatic speech recognition using acoustic doppler signal (초음파 도플러를 이용한 음성 인식)

  • Lee, Ki-Seung
    • The Journal of the Acoustical Society of Korea
    • /
    • v.35 no.1
    • /
    • pp.74-82
    • /
    • 2016
  • In this paper, a new automatic speech recognition (ASR) was proposed where ultrasonic doppler signals were used, instead of conventional speech signals. The proposed method has the advantages over the conventional speech/non-speech-based ASR including robustness against acoustic noises and user comfortability associated with usage of the non-contact sensor. In the method proposed herein, 40 kHz ultrasonic signal was radiated toward to the mouth and the reflected ultrasonic signals were then received. Frequency shift caused by the doppler effects was used to implement ASR. The proposed method employed multi-channel ultrasonic signals acquired from the various locations, which is different from the previous method where single channel ultrasonic signal was employed. The PCA(Principal Component Analysis) coefficients were used as the features of ASR in which hidden markov model (HMM) with left-right model was adopted. To verify the feasibility of the proposed ASR, the speech recognition experiment was carried out the 60 Korean isolated words obtained from the six speakers. Moreover, the experiment results showed that the overall word recognition rates were comparable with the conventional speech-based ASR methods and the performance of the proposed method was superior to the conventional signal channel ASR method. Especially, the average recognition rate of 90 % was maintained under the noise environments.

Voice Onset Time of Korean Stops as a Function of Speaking Rate (발화 속도에 따른 한국어 폐쇄음의 VOT 값 변화)

  • Oh, Eun-Jin
    • Phonetics and Speech Sciences
    • /
    • v.1 no.3
    • /
    • pp.39-48
    • /
    • 2009
  • Previous studies on the effects of speaking rate on voice onset time (VOT) of stops in English, French, Icelandic, and Thai indicate that speaking rate asymmetrically affects VOT values. That is, pre-voiced and long-lag stops vary due to the rate factor more than short-lag stops do. One suggested explanation for this asymmetry is that it is due to the necessity of maintaining phonetic contrasts among the stop categories. Since pre-voiced and long-lag stops represent the ends of the VOT scale, they encompass broad swathes of that range and consequently allow for large variations. On the other hand, the VOT variations of short-lag stops may result in overlap with the VOTs of long-lag stops. This study aimed to explore the effects of speaking rate on the VOTs of Korean stops and see whether Korean fortis and lenis stops are limited in the degrees of variation as a function of rates due to the existence of stops with larger VOT values, lenis and aspirated stops respectively. Conversely, aspirated stops were expected to show more variation since there are no other categories with longer VOTs. Fortis, lenis, and aspirated stops in /CVn/ words (C = bilabial or velar stop, V = /i/ or /a/) were examined in isolation, and at normal and fast rates in a carrier sentence. Speaking rates were controlled by alternating words or sentences on a computer screen at intervals of two seconds for the isolation- and normal-rate conditions and one second for the fast-rate condition. This study found that while the VOTs of fortis stops did not change significantly, those of lenis and aspirated stops showed considerable changes as a function of speaking rates. Also, overlap between lenis and aspirated stops occurred considerably at all speaking rates. These phenomena were interpreted to relate to the fact that VOT contrasts between lenis and aspirated stops in Korean are currently being collapsed. Large variations of lenis stops as a function of rates seem to occur due to a weak motivation to limit the degree of variations for the purpose of maintaining phonetic contrasts. The significant overlap between lenis and aspirated stops at all rates was interpreted to occur because the VOT merger between the two categories became considerably fixed. Also the percentage of correctly-classified VOTs by optimal-boundary values between lenis and aspirated stops turned out to be lower than in previously-studied languages. This was interpreted to be further evidence that VOTs are losing their role in contrasting the two stop categories in Korean.

  • PDF

Real-time Implementation of AMR-WB Speech Codec Using TeakLite DSP (TeakLite DSP를 이용한 적응형 다중 비트율 광대역 (AMR-WB) 음성부호화기의 실시간 구현)

  • 정희범;김경수;한민수;변경진
    • The Journal of the Acoustical Society of Korea
    • /
    • v.23 no.3
    • /
    • pp.262-267
    • /
    • 2004
  • AMR-WB (Adaptive Multi Rate Wideband) speech codec, the most recent voice codec standardized by 3GPP, has the wider audio bandwidth of 50∼7000 Hz and operates on nine speech coding bit rates between 6.60 and 23.85 kbit/s. This Paper presents the real-time implementation of AMR-WB speech codec by using a 16 bit fixed-point TeakLite DSP. The implemented AMR-WB codec requires the complexity of 52.2 MIPS at 23.85 kbit/s mode and also needs the program memory of 17.9 kwords, data RAM of 11.8 kwords, and data ROM of 10.1kwords. It was verified through passing the all test vectors provided by 3GPP with maintaining bit exactness. Stable operations on the real-time testing board were also proved without any distortions and delays for the audio in/out.

A Study on Connected Digits Recognition Using the K-L Expansion (K-L 전개를 이용한 연속 숫자음 인식에 관한 연구)

  • 김주곤;오세진;황철준;김범국;정현열
    • Journal of the Institute of Convergence Signal Processing
    • /
    • v.2 no.3
    • /
    • pp.24-31
    • /
    • 2001
  • The K-L expansion is a method for compressing dimensions of features and thus reduces computational cost in recognition process. Also This is well known that features can be extracted without much loss of information in the statistical pattern recognition. In this paper, the method that effectively applies K-L(Karhunen-Loeve) expansion to feature parameters of speech is proposed to improve the recognition accuracy of the Korean speech recognition system. The recognition performance of a novel feature parameters obtained by the proposed method(K-L coefficients) is compared with those of conventional Mel-cepstrum and regressive coefficients through speaker independent connected digits recognition experiments. Experimental results showed that average recognition rates using the K-L coefficients with regression coefficients obtained higher accuracy than conventional Mel-cepstrum with their regression coefficients.

  • PDF

The Role of Post-lexical Intonational Patterns in Korean Word Segmentation

  • Kim, Sa-Hyang
    • Speech Sciences
    • /
    • v.14 no.1
    • /
    • pp.37-62
    • /
    • 2007
  • The current study examines the role of post-lexical tonal patterns of a prosodic phrase in word segmentation. In a word spotting experiment, native Korean listeners were asked to spot a disyllabic or trisyllabic word from twelve syllable speech stream that was composed of three Accentual Phrases (AP). Words occurred with various post-lexical intonation patterns. The results showed that listeners spotted more words in phrase-initial than in phrase-medial position, suggesting that the AP-final H tone from the preceding AP helped listeners to segment the phrase-initial word in the target AP. Results also showed that listeners' error rates were significantly lower when words occurred with initial rising tonal pattern, which is the most frequent intonational pattern imposed upon multisyllabic words in Korean, than with non-rising patterns. This result was observed both in AP-initial and in AP-medial positions, regardless of the frequency and legality of overall AP tonal patterns. Tonal cues other than initial rising tone did not positively influence the error rate. These results not only indicate that rising tone in AP-initial and AP_final position is a reliable cue for word boundary detection for Korean listeners, but further suggest that phrasal intonation contours serve as a possible word boundary cue in languages without lexical prominence.

  • PDF

Vowel Recognition Using the Fractal Dimension (프랙탈 차원을 이용한 모음인식)

  • 최철영;김형순;김재호;손경식
    • The Journal of Korean Institute of Communications and Information Sciences
    • /
    • v.19 no.6
    • /
    • pp.1140-1148
    • /
    • 1994
  • In this paper, we carried out some experiments on the Korean vowel recognition using the fractal dimension of the speech signals. We chose the Minkowski-Bouligand dimension as the fractal dimension, and computed it using the morphological covering method. For our experiments, we used both the fractal dimension and the LPC cepstrum which is conventionally known to be one of the best parameters for speech recognition, and examined the usefulness of the fractal dimension. From the vowel recognition experiments under various consonant contexts, we achieved the vowel recognition error rates of 5.6% and 3.2% for the case with only LPC cepstrum and that with both LPC cepstrum and the fractal dimension, respectively. The results indicate that the incorporation of the fractal dimension with LPC cepstrum gives more than 40% reduction in recognition errors, and indicates that the fractal dimension is a useful feature parameter for speech recognition.

  • PDF

A Study on the Aerodynamic and Acoustic Characteristics in Dysarthria Speakers' Diadochokinesis by Articulation Valves in Vocal Tract (마비성구어장애 화자의 조음밸브 교호운동에 관한 공기역학 및 음향학적 특징)

  • Park, Hee-June;Kwon, Soon-Bok;Wang, Soo-Geun;Jeong, Ok-Ran
    • Speech Sciences
    • /
    • v.15 no.2
    • /
    • pp.177-189
    • /
    • 2008
  • This study was to investigate diadochokinetic (DDK) rate, regularity and mean flow rate of articulation valves in dysarthria. DDK rate, mean airflow rate (MFR) and regularity of DDK syllable repetitions of vocal function /ihi/, tongue function /ta/, velopharyngeal function /bm/, and labial function /pa/ in 24 normal and dysarthric speakers were measured. Aerophone Ⅱ and Motor Speech Profile were used for data recording and analysis. The results of the findings were as follows: First, there were significant differences between the dysarthria and the normal group in DDK rate. DDK rates in ataxic dysarthria were the lowest and spastic, flaccid, and hypokinetic dysarthria followed in sequence. Second, there was a significant difference between the dysarthria and the normal group in DDK regularity. Third, there was a significant difference between dysarthria groups and normal group in DDK MFR. Finally, there was a significant difference between the 4 groups of dysarthria and the normal group in DDK air flow tracking. The results of this study can be guidelines for normal DDK rate, regularity and flow rate in dysarthria groups. In addition, their differential diagnoses and descriptions are important to make a decision on medical and behavioral management of the individuals with disorders according to DDK characteristics.

  • PDF