• Title/Summary/Keyword: speech speed

Search Result 238, Processing Time 0.024 seconds

FPGA-Based Hardware Accelerator for Feature Extraction in Automatic Speech Recognition

  • Choo, Chang;Chang, Young-Uk;Moon, Il-Young
    • Journal of information and communication convergence engineering
    • /
    • v.13 no.3
    • /
    • pp.145-151
    • /
    • 2015
  • We describe in this paper a hardware-based improvement scheme of a real-time automatic speech recognition (ASR) system with respect to speed by designing a parallel feature extraction algorithm on a Field-Programmable Gate Array (FPGA). A computationally intensive block in the algorithm is identified implemented in hardware logic on the FPGA. One such block is mel-frequency cepstrum coefficient (MFCC) algorithm used for feature extraction process. We demonstrate that the FPGA platform may perform efficient feature extraction computation in the speech recognition system as compared to the generalpurpose CPU including the ARM processor. The Xilinx Zynq-7000 System on Chip (SoC) platform is used for the MFCC implementation. From this implementation described in this paper, we confirmed that the FPGA platform is approximately 500× faster than a sequential CPU implementation and 60× faster than a sequential ARM implementation. We thus verified that a parallelized and optimized MFCC architecture on the FPGA platform may significantly improve the execution time of an ASR system, compared to the CPU and ARM platforms.

A comparison between affective prosodic characteristics observed in children with cochlear implant and normal hearing (인공와우 이식 아동과 정상 청력 아동의 정서적 운율 특성 비교)

  • Oh, Yeong Geon;Seong, Cheoljae
    • Phonetics and Speech Sciences
    • /
    • v.8 no.3
    • /
    • pp.67-78
    • /
    • 2016
  • This study examined the affective prosodic characteristics observed from the children with cochlear implant (CI, hereafter) and normal hearing (NH, hereafter) along with listener's perception on them. Speech samples were acquired from 15 normal and 15 CI children. 8 SLPs(Speech Language Pathologists) perceptually evaluated affective types using Praat's ExperimentMFC. When it comes to the acoustic results, there were statistically meaningful differences between 2 groups in affective types [joy (discriminated by intensity deviation), anger (by intensity-related variables dominantly and duration-related variables partly), and sadness (by all aspects of prosodic variables)]. CI's data are much more louder when expressing joy, louder and slower when expressing anger, and higher, louder, and slower when it comes to sadness than those of NH. The listeners showed much higher correlation when evaluating normal children than CI group(p<.001). Chi-square results revealed that listeners did not show coherence at CI's utterance, but did at those of NH's (CI(p<.01), normal(p=.48)). When CI utterances were discriminated into 3 emotional types by DA(Discriminant Analysis) using 8 acoustic variables, speed related variables such as articulation rate took primary role.

Support Vector Machine Based Phoneme Segmentation for Lip Synch Application

  • Lee, Kun-Young;Ko, Han-Seok
    • Speech Sciences
    • /
    • v.11 no.2
    • /
    • pp.193-210
    • /
    • 2004
  • In this paper, we develop a real time lip-synch system that activates 2-D avatar's lip motion in synch with an incoming speech utterance. To realize the 'real time' operation of the system, we contain the processing time by invoking merge and split procedures performing coarse-to-fine phoneme classification. At each stage of phoneme classification, we apply the support vector machine (SVM) to reduce the computational load while retraining the desired accuracy. The coarse-to-fine phoneme classification is accomplished via two stages of feature extraction: first, each speech frame is acoustically analyzed for 3 classes of lip opening using Mel Frequency Cepstral Coefficients (MFCC) as a feature; secondly, each frame is further refined in classification for detailed lip shape using formant information. We implemented the system with 2-D lip animation that shows the effectiveness of the proposed two-stage procedure in accomplishing a real-time lip-synch task. It was observed that the method of using phoneme merging and SVM achieved about twice faster speed in recognition than the method employing the Hidden Markov Model (HMM). A typical latency time per a single frame observed for our method was in the order of 18.22 milliseconds while an HMM method applied under identical conditions resulted about 30.67 milliseconds.

  • PDF

A study on Variable Step Size algorithms for Convergence Speed Improvement of Frequency-Domain Adaptive Filter (주파수영역 적응필터의 수렴속도 향상을 위한 가변스텝사이즈 알고리즘에 관한 연구)

  • 정희준;오신범;이채욱
    • Proceedings of the IEEK Conference
    • /
    • 2000.11d
    • /
    • pp.191-194
    • /
    • 2000
  • Frequency domain adaptive filter is effective to communication fields of many computational requirements. In this paper we propose a new variable step size algorithms which improves the convergence speed and reduces computational complexity for frequency domain adaptive filter. we compared MSE of the proposed algorithms with one of normalized FLMS using computer simulation of adaptive noise canceler based on synthesis speech.

  • PDF

Decision of the Korean Speech Act using Feature Selection Method (자질 선택 기법을 이용한 한국어 화행 결정)

  • 김경선;서정연
    • Journal of KIISE:Software and Applications
    • /
    • v.30 no.3_4
    • /
    • pp.278-284
    • /
    • 2003
  • Speech act is the speaker's intentions indicated through utterances. It is important for understanding natural language dialogues and generating responses. This paper proposes the method of two stage that increases the performance of the korean speech act decision. The first stage is to select features from the part of speech results in sentence and from the context that uses previous speech acts. We use x$^2$ statistics(CHI) for selecting features that have showed high performance in text categorization. The second stage is to determine speech act with selected features and Neural Network. The proposed method shows the possibility of automatic speech act decision using only POS results, makes good performance by using the higher informative features and speed up by decreasing the number of features. We tested the system using our proposed method in Korean dialogue corpus transcribed from recording in real fields, and this corpus consists of 10,285 utterances and 17 speech acts. We trained it with 8,349 utterances and have test it with 1,936 utterances, obtained the correct speech act for 1,709 utterances(88.3%). This result is about 8% higher accuracy than without selecting features.

Efficient Speech Enhancement based on left-right HMM with State Sequence Decision Using LRT (좌-우향 은닉 마코프 모델에서 상태결정을 이용한 음질향상)

  • 이기용
    • The Journal of the Acoustical Society of Korea
    • /
    • v.23 no.1
    • /
    • pp.47-53
    • /
    • 2004
  • We propose a new speech enhancement algorithm based on left-right Hidden Markov Model (HMM) with state decision using Log-likelihood Ratio Test (LRT). Since the conventional HMM-based speech enhancement methods try to improve speech quality for all states, they introduce huge computational loads inappropriate to real-time implementation. In the left-right HMM, only the current and the next state are considered for a possible state transition so to reduce the computational complexity. In this paper, we propose a method to decide the current state by using the LRT on the previous state. Experimental results show that the proposed method improves the speed up to 60% with 0.2∼0.4 dB degradation of speech quality compared to the conventional method.

A Comparative Study of Vocal Fold Vibratory Behaviors Shown in the Phonation of the /i/ Vowel between Persons who Stutter and Persons with Muscle Tension Dysphonia Using High-Speed Digital Imaging (초고속 성대촬영기(High-Speed Digital Imaging)를 이용한 말더듬인과 근 긴장성 발성장애인의 /이/모음 발성 시 성대 진동 양상에 관한 비교 연구)

  • Jung, Hun;Ahn, Jong-Bok;Park, Jin-Hyaung;Choi, Byung-Heun;Kwon, Do-Ha
    • Phonetics and Speech Sciences
    • /
    • v.1 no.4
    • /
    • pp.195-201
    • /
    • 2009
  • The purpose of this study was to use high-speed digital imaging (HSDI) to compare vocal vibratory behaviors of persons who stutter (PWS) and persons with muscle tension dysphonia (PMTD) for uttering the /i/ vowel in a bid to identify the characteristics of vocal fold vibratory behaviors of PWS. This study surveyed seven developmental PWSs and seven PMTDs. The findings of the study indicated the following: first, regarding the two groups' vocal fold vibratory behaviors, of seven PWSs, three were found to be close vocal tract (VC) and four were found to be combination vocal tract (VCB). Of the seven PMTDs, one was found to be VC, and the other six were found to be VCB. These results indicate that a voiceprint which is different from the open vocal tract (VO) found in normal groups in research conducted by Jung, et al. (2008b) appeared in both groups of this study. Even between the two groups, there is a difference in the voiceprint before vocalization. Second, a VKG analysis was conducted to identify the two groups' vocal cord contact quotient. As a result, the PWS group's vocal cord contact quotient changed gradually from an irregular one at the initial vocalization stage to a regular one. The PMTD group continued the tension at the initial vocalization. Putting together all of these results, there is a difference in vocal fold vibratory behaviors between PWSs and PMTDs when they speak. Thus, there was a difference in muscular tension between the two groups.

  • PDF

Improving Speaker Enrolling Speed for Speaker Verification Systems Based on Multilayer Perceptrons by Using a Qualitative Background Speaker Selection (정질적 기준을 이용한 다층신경망 기반 화자증명 시스템의 등록속도 단축방법)

  • 이태승;황병원
    • The Journal of the Acoustical Society of Korea
    • /
    • v.22 no.5
    • /
    • pp.360-366
    • /
    • 2003
  • Although multilayer perceptrons (MLPs) present several advantages against other pattern recognition methods, MLP-based speaker verification systems suffer from slow enrollment speed caused by many background speakers to achieve a low verification error. To solve this problem, the quantitative discriminative cohort speakers (QnDCS) method, by introducing the cohort speakers method into the systems, reduced the number of background speakers required to enroll speakers. Although the QnDCS achieved the goal to some extent, the improvement rate for the enrolling speed was still unsatisfactory. To improve the enrolling speed, this paper proposes the qualitative DCS (QlDCS) by introducing a qualitative criterion to select less background speakers. An experiment for both methods is conducted to use the speaker verification system based on MLPs and continuants, and speech database. The results of the experiment show that the proposed QlDCS method enrolls speakers in two times shorter time than the QnDCS does over the online error backpropagation(EBP) method.

Evaluating Impressions of Robots According to the Robot's Embodiment Level and Response Speed (로봇의 외형 구체화 정도 및 반응속도에 따른 로봇 인상 평가)

  • Kang, Dahyun;Kwak, Sonya S.
    • Design Convergence Study
    • /
    • v.16 no.6
    • /
    • pp.153-167
    • /
    • 2017
  • Nowadays, as many robots are developed for desktop, users interact with the robots based on speech. However, due to technical limitations related to speech-based interaction, an alternative is needed. We designed this research to design a robot that interacts with the user by using unconditional reflection of biological signals. In order to apply bio-signals to robots more effectively, we evaluated the robots' overall service evaluation, perceived intelligence, appropriateness, trustworthy, and sociability according to the degree of the robot's embodiment level and the response speed of the robot. The result showed that in terms of intelligence and appropriateness, 3D robot with higher embodiment level was more positively evaluated than 2D robot with lower embodiment level. Also, the robot with faster response rate was evaluated more favorably in overall service evaluation, intelligence, appropriateness, trustworthy, and sociability than the robot with slower response rate. In addition, in service evaluation, trustworthy, and sociability, there were interaction effects according to the robot's embodiment level and the response speed.

The Study on the Effects of Vocal Function Exercise for Trained Singers (성악인의 발성능력 향상에 Vocal Function Exercise가 미치는 영향)

  • Kwon, Young-Kyung;Sim, Hyun-Sub;Jin, Sung-Min;Chung, Sung-Min
    • Speech Sciences
    • /
    • v.10 no.2
    • /
    • pp.169-189
    • /
    • 2003
  • Trained singers, one group of professional voice users, have much more interest on the voice than common people, and on its management, too. They train for singing beautiful songs, and, at the same time, try for efficient voice production. The present study was performed with three tenors and three baritones, undergraduate students majored in classical singing, to investigate the degree of improvement of their voice production efficiency through vocal function exercise, by measuring the three dependent variables, maximum phonation time, speed quotient of glottal contact, and the number of semi tones. For the baseline establishment, dependent variables were measured 3$\sim$6 times for two weeks. Then, the subjects exercised vocal function exercise for seven weeks, and after the termination of training, evaluation was performed four times for two weeks, to find the maintenance of the training effect. Vocal function exercise is composed of four successive steps: warm-up, stretching exercise, contracting exercise, power exercise. As results, all of six subjects showed improvement in the aspect of maximum phonation time, speed quotient if glottal contact, and the number of semitones.

  • PDF