• Title/Summary/Keyword: End-to-end Automatic Speech Recognition

Search Result 16, Processing Time 0.021 seconds

Development of Automatic Creating Web-Site Tool for the Blind (시각장애인용 웹사이트 자동생성 툴 개발)

  • Baek, Hyeun-Ki;Ha, Tai-Hyun
    • Journal of Digital Contents Society
    • /
    • v.8 no.4
    • /
    • pp.467-474
    • /
    • 2007
  • This paper documents the design and implementation of an automatic creating web-site tool for the blind to build their own homepage by using both voice recognition and voice mixed technology with equal ease as the non-disabled. The blind can make voice mails, schedules, address lists and bookmarks by making use of the tool. It also facilitates communication between the non-disabled with the help of their information management system. This tool converts basic commands into voice recognition, also making an offer of text-to-speech which supports voice output. In the end, the tool will remove the blind's social isolation, allowing them to enjoy the information age like the non-disabled.

  • PDF

Voice Activity Detection with Run-Ratio Parameter Derived from Runs Test Statistic

  • Oh, Kwang-Cheol
    • Speech Sciences
    • /
    • v.10 no.1
    • /
    • pp.95-105
    • /
    • 2003
  • This paper describes a new parameter for voice activity detection which serves as a front-end part for automatic speech recognition systems. The new parameter called run-ratio is derived from the runs test statistic which is used in the statistical test for randomness of a given sequence. The run-ratio parameter has the property that the values of the parameter for the random sequence are about 1. To apply the run-ratio parameter into the voice activity detection method, it is assumed that the samples of an inputted audio signal should be converted to binary sequences of positive and negative values. Then, the silence region in the audio signal can be regarded as random sequences so that their values of the run-ratio would be about 1. The run-ratio for the voiced region has far lower values than 1 and for fricative sounds higher values than 1. Therefore, the parameter can discriminate speech signals from the background sounds by using the newly derived run-ratio parameter. The proposed voice activity detector outperformed the conventional energy-based detector in the sense of error mean and variance, small deviation from true speech boundaries, and low chance of missing real utterances

  • PDF

Speech enhancement method based on feature compensation gain for effective speech recognition in noisy environments (잡음 환경에 효과적인 음성인식을 위한 특징 보상 이득 기반의 음성 향상 기법)

  • Bae, Ara;Kim, Wooil
    • The Journal of the Acoustical Society of Korea
    • /
    • v.38 no.1
    • /
    • pp.51-55
    • /
    • 2019
  • This paper proposes a speech enhancement method utilizing the feature compensation gain for robust speech recognition performances in noisy environments. In this paper we propose a speech enhancement method utilizing the feature compensation gain which is obtained from the PCGMM (Parallel Combined Gaussian Mixture Model)-based feature compensation method employing variational model composition. The experimental results show that the proposed method significantly outperforms the conventional front-end algorithms and our previous research over various background noise types and SNR (Signal to Noise Ratio) conditions in mismatched ASR (Automatic Speech Recognition) system condition. The computation complexity is significantly reduced by employing the noise model selection technique with maintaining the speech recognition performance at a similar level.

Voice Activity Detection Based on Signal Energy and Entropy-difference in Noisy Environments (엔트로피 차와 신호의 에너지에 기반한 잡음환경에서의 음성검출)

  • Ha, Dong-Gyung;Cho, Seok-Je;Jin, Gang-Gyoo;Shin, Ok-Keun
    • Journal of Advanced Marine Engineering and Technology
    • /
    • v.32 no.5
    • /
    • pp.768-774
    • /
    • 2008
  • In many areas of speech signal processing such as automatic speech recognition and packet based voice communication technique, VAD (voice activity detection) plays an important role in the performance of the overall system. In this paper, we present a new feature parameter for VAD which is the product of energy of the signal and the difference of two types of entropies. For this end, we first define a Mel filter-bank based entropy and calculate its difference from the conventional entropy in frequency domain. The difference is then multiplied by the spectral energy of the signal to yield the final feature parameter which we call PEED (product of energy and entropy difference). Through experiments. we could verify that the proposed VAD parameter is more efficient than the conventional spectral entropy based parameter in various SNRs and noisy environments.

A New Endpoint Detection Method Based on Chaotic System Features for Digital Isolated Word Recognition System (음성인식을 위한 혼돈시스템 특성기반의 종단탐색 기법)

  • Zang, Xian;Chong, Kil-To
    • Journal of the Institute of Electronics Engineers of Korea SC
    • /
    • v.46 no.5
    • /
    • pp.8-14
    • /
    • 2009
  • In the research field of speech recognition, pinpointing the endpoints of speech utterance even with the presence of background noise is of great importance. These noise present during recording introduce disturbances which complicates matters since what we just want is to get the stationary parameters corresponding to each speech section. One major cause of error in automatic recognition of isolated words is the inaccurate detection of the beginning and end boundaries of the test and reference templates, thus the necessity to find an effective method in removing the unnecessary regions of a speech signal. The conventional methods for speech endpoint detection are based on two linear time-domain measurements: the short-time energy, and short-time zero-crossing rate. They perform well for clean speech but their precision is not guaranteed if there is noise present, since the high energy and zero-crossing rate of the noise is mistaken as a part of the speech uttered. This paper proposes a novel approach in finding an apparent threshold between noise and speech based on Lyapunov Exponents (LEs). This proposed method adopts the nonlinear features to analyze the chaos characteristics of the speech signal instead of depending on the unreliable factor-energy. The excellent performance of this approach compared with the conventional methods lies in the fact that it detects the endpoints as a nonlinearity of speech signal, which we believe is an important characteristic and has been neglected by the conventional methods. The proposed method extracts the features based only on the time-domain waveform of the speech signal illustrating its low complexity. Simulations done showed the effective performance of the Proposed method in a noisy environment with an average recognition rate of up 92.85% for unspecified person.

Automatic Speech Style Recognition Through Sentence Sequencing for Speaker Recognition in Bilateral Dialogue Situations (양자 간 대화 상황에서의 화자인식을 위한 문장 시퀀싱 방법을 통한 자동 말투 인식)

  • Kang, Garam;Kwon, Ohbyung
    • Journal of Intelligence and Information Systems
    • /
    • v.27 no.2
    • /
    • pp.17-32
    • /
    • 2021
  • Speaker recognition is generally divided into speaker identification and speaker verification. Speaker recognition plays an important function in the automatic voice system, and the importance of speaker recognition technology is becoming more prominent as the recent development of portable devices, voice technology, and audio content fields continue to expand. Previous speaker recognition studies have been conducted with the goal of automatically determining who the speaker is based on voice files and improving accuracy. Speech is an important sociolinguistic subject, and it contains very useful information that reveals the speaker's attitude, conversation intention, and personality, and this can be an important clue to speaker recognition. The final ending used in the speaker's speech determines the type of sentence or has functions and information such as the speaker's intention, psychological attitude, or relationship to the listener. The use of the terminating ending has various probabilities depending on the characteristics of the speaker, so the type and distribution of the terminating ending of a specific unidentified speaker will be helpful in recognizing the speaker. However, there have been few studies that considered speech in the existing text-based speaker recognition, and if speech information is added to the speech signal-based speaker recognition technique, the accuracy of speaker recognition can be further improved. Hence, the purpose of this paper is to propose a novel method using speech style expressed as a sentence-final ending to improve the accuracy of Korean speaker recognition. To this end, a method called sentence sequencing that generates vector values by using the type and frequency of the sentence-final ending appearing in the utterance of a specific person is proposed. To evaluate the performance of the proposed method, learning and performance evaluation were conducted with a actual drama script. The method proposed in this study can be used as a means to improve the performance of Korean speech recognition service.