• 제목/요약/키워드: speech speed

검색결과 238건 처리시간 0.02초

End-to-end 비자기회귀식 가속 음성합성기 (End-to-end non-autoregressive fast text-to-speech)

  • 김위백;남호성
    • 말소리와 음성과학
    • /
    • 제13권4호
    • /
    • pp.47-53
    • /
    • 2021
  • Autoregressive한 TTS 모델은 불안정성과 속도 저하라는 본질적인 문제를 안고 있다. 모델이 time step t의 데이터를 잘못 예측했을 때, 그 뒤의 데이터도 모두 잘못 예측하는 것이 불안정성 문제이다. 음성 출력 속도 저하 문제는 모델이 time step t의 데이터를 예측하려면 time step 1부터 t-1까지의 예측이 선행해야 한다는 조건에서 발생한다. 본 연구는 autoregression이 야기하는 문제의 대안으로 end-to-end non-autoregressive 가속 TTS 모델을 제안한다. 본 연구의 모델은 Tacotron 2 - WaveNet 모델과 근사한 MOS, 더 높은 안정성 및 출력 속도를 보였다. 본 연구는 제안한 모델을 토대로 non-autoregressive한 TTS 모델 개선에 시사점을 제공하고자 한다.

스크린리더를 사용하는 시각장애인의 한국어 합성음 청취속도 연구 (A Study of Korean TTS Listening Speed for the Blind Using a Screen Reader)

  • 이희연;홍기형
    • 말소리와 음성과학
    • /
    • 제5권3호
    • /
    • pp.63-69
    • /
    • 2013
  • The purpose of this study was to evaluate the maximum and optimal listening speed of Korean TTS for the blind. Five blind participants took part in this study. The instruments used in this study were 17 sentence sets (2 sets for an excercise, 10 sets for a repeated test, and 5 sets for a random test), with short meaningful sentences (the same sentences for the repeated test, different sentences for the random test) with 15 differentiated speeds (Range=0.8-3.6, SD=0.2). Each participant's maximum and quickest listening speeds were calculated by objective recall accuracy (determined by the number of correctly recalled syllables/the total number of syllables in a sentence X 100) and subjective recall accuracy (recall accuracy judged by each participant's subjective evaluation). The results showed that the participants' recall accuracy had a tendency to increase as the TTS speed decreased. Participants' subjective recall accuracy was higher than objective recall accuracy in the repeated tests and vice versa in the random tests. The results also revealed that the participants' sentence familiarity had an influence on their Korean TTS listening speed.

자동차 주행 환경에서의 음성 전달 명료도와 음성 인식 성능 비교 (Comparison of Speech Intelligibility & Performance of Speech Recognition in Real Driving Environments)

  • 이광현;최대림;김영일;김봉완;이용주
    • 대한음성학회지:말소리
    • /
    • 제50호
    • /
    • pp.99-110
    • /
    • 2004
  • The normal transmission characteristics of sound are hardly obtained due to the various noises and structural factors in a running car environment. It is due to the channel distortion of the original source sound recorded by microphones, and it seriously degrades the performance of the speech recognition in real driving environments. In this paper we analyze the degree of intelligibility under the various sound distortion environments by channels according to driving speed with respect to speech transmission index(STI) and compare the STI with rates of speech recognition. We examine the correlation between measures of intelligibility depending on sound pick-up patterns and performance in speech recognition. Thereby we consider the optimal location of a microphone in single channel environment. In experimentation we find that high correlation is obtained between STI and rates of speech recognition.

  • PDF

말속도가 인공와우 청각장애인의 문장지각에 미치는 영향 (Effects of Speech Rate on the Sentence Perception of Adults with Cochlear Implantation)

  • 신수진;신지철;윤미선;김덕용
    • 음성과학
    • /
    • 제13권2호
    • /
    • pp.47-58
    • /
    • 2006
  • People tend to control their speech rate to help those with listening problems such as hearing impaired people. The aim of this study was to investigate effects of speech rate on the sentence perception by 10 adults with cochlear implantation. The sample speech included 42 sentences at normal, slow, and very slow speed focusing on the overall duration, vowel or pause duration. The subjects listened to the speech and wrote down what they heard. Each correct syllable of the content words in the sentence was counted to obtain the score. Partial points were given to the incomplete syllables. Results of this study were as follows: 1. The changes of speech rate had some influence on the sentence perception score by the cochlear implanted people. 2. In slow pause condition, the controlled speech rate had a positive effect on the perception score.

  • PDF

PASS: A Parallel Speech Understanding System

  • Chung, Sang-Hwa
    • Journal of Electrical Engineering and information Science
    • /
    • 제1권1호
    • /
    • pp.1-9
    • /
    • 1996
  • A key issue in spoken language processing has become the integration of speech understanding and natural language processing(NLP). This paper presents a parallel computational model for the integration of speech and NLP. The model adopts a hierarchically-structured knowledge base and memory-based parsing techniques. Processing is carried out by passing multiple markers in parallel through the knowledge base. Speech-specific problems such as insertion, deletion, and substitution have been analyzed and their parallel solutions are provided. The complete system has been implemented on the Semantic Network Array Processor(SNAP) and is operational. Results show an 80% sentence recognition rate for the Air Traffic Control domain. Moreover, a 15-fold speed-up can be obtained over an identical sequential implementation with an increasing speed advantage as the size of the knowledge base grows.

  • PDF

한국 표준어 연속음성에서의 억양구와 강세구 자동 검출 (Automatic Detection of Intonational and Accentual Phrases in Korean Standard Continuous Speech)

  • 이기영;송민석
    • 음성과학
    • /
    • 제7권2호
    • /
    • pp.209-224
    • /
    • 2000
  • This paper proposes an automatic detection method of intonational and accentual phrases in Korean standard continuous speech. We use the pause over 150 msec for detecting intonational phrases, and extract accentual phrases from the intonational phrases by analyzing syllables and pitch contours. The speech data for the experiment are composed of seven male voices and two female voices which read the texts of the fable 'the ant and the grasshopper' and a newspaper article 'manmulsang' in normal speed and in Korean standard variation. The results of the experiment shows that the detection rate of intonational phrases is 95% on the average and that of accentual phrases is 73%. This detection rate implies that we can segment the continuous speech into smaller units(i.e. prosodic phrases) by using the prosodic information and so the objects of speech recognition can narrow down to words or phrases in continuous speech.

  • PDF

Variable LPF에 의한 피치검출 (The Pitch Detection Using Variable LPF)

  • 백금란
    • 한국음향학회:학술대회논문집
    • /
    • 한국음향학회 1993년도 학술논문발표회 논문집 제12권 1호
    • /
    • pp.88-92
    • /
    • 1993
  • In speech signal processing, it is necessary to detect exactly the pitch. The algorithms of pitch extraction which have been proposed until now are difficult to detect pitches over wide range speech signals. Thus we propose a new algorithm which uses the G-peak extraction to do it. It is the method that finds the most MZI(maximum zero-crossing interval) at each frame and convolve it with speech signal ; this is the same with passing speech signals to variable LPF. Finally we obtained the pitch, improve the accuracy of pitch detection and extract it with the high speed.

  • PDF

정서음성 합성을 위한 예비연구 (Preliminary Study on Synthesis of Emotional Speech)

  • 한영호;이서배;이정철;김형순
    • 대한음성학회:학술대회논문집
    • /
    • 대한음성학회 2003년도 10월 학술대회지
    • /
    • pp.181-184
    • /
    • 2003
  • This paper explores the perceptual relevance of acoustical correlates of emotional speech by using formant synthesizer. The focus is on the role of mean pitch, pitch range, speed rate and phonation type when it comes to synthesizing emotional speech. The result of this research is backing up the traditional impressionistic observations. However it suggests that some phonation types should be synthesized with further refinement.

  • PDF

음질 및 속도 향상을 위한 선형 스펙트로그램 활용 Text-to-speech (Text-to-speech with linear spectrogram prediction for quality and speed improvement)

  • 윤혜빈
    • 말소리와 음성과학
    • /
    • 제13권3호
    • /
    • pp.71-78
    • /
    • 2021
  • 인공신경망에 기반한 대부분의 음성 합성 모델은 고음질의 자연스러운 발화를 생성하기 위해 보코더 모델을 사용한다. 보코더 모델은 멜 스펙트로그램 예측 모델과 결합하여 멜 스펙트로그램을 음성으로 변환한다. 그러나 보코더 모델을 사용할 경우에는 많은 양의 컴퓨터 메모리와 훈련 시간이 필요하며, GPU가 제공되지 않는 실제 서비스 환경에서 음성 합성이 오래 걸린다는 단점이 있다. 기존의 선형 스펙트로그램 예측 모델에서는 보코더 모델을 사용하지 않으므로 이 문제가 발생하지 않지만, 대신에 고품질의 음성을 생성하지 못한다. 본 논문은 뉴럴넷 기반 보코더를 사용하지 않으면서도 양질의 음성을 생성하는 Tacotron 2 & Transformer 기반의 선형 스펙트로그램 예측 모델을 제시한다. 본 모델의 성능과 속도 측정 실험을 진행한 결과, 보코더 기반 모델에 비해 성능과 속도 면에서 조금 더 우세한 점을 보였으며, 따라서 고품질의 음성을 빠른 속도로 생성하는 음성 합성 모델 연구의 발판 역할을 할 것으로 기대한다.

TMS320C2000계열 DSP를 이용한 단일칩 음성인식기 구현 (Implementation of a Single-chip Speech Recognizer Using the TMS320C2000 DSPs)

  • 정익주
    • 음성과학
    • /
    • 제14권4호
    • /
    • pp.157-167
    • /
    • 2007
  • In this paper, we implemented a single-chip speech recognizer using the TMS320C2000 DSPs. For this implementation, we had developed very small-sized speaker-dependent recognition engine based on dynamic time warping, which is especially suited for embedded systems where the system resources are severely limited. We carried out some optimizations including speed optimization by programming time-critical functions in assembly language, and code size optimization and effective memory allocation. For the TMS320F2801 DSP which has 12Kbyte SRAM and 32Kbyte flash ROM, the recognizer developed can recognize 10 commands. For the TMS320F2808 DSP which has 36Kbyte SRAM and 128Kbyte flash ROM, it has additional capability of outputting the speech sound corresponding to the recognition result. The speech sounds for response, which are captured when the user trains commands, are encoded using ADPCM and saved on flash ROM. The single-chip recognizer needs few parts except for a DSP itself and an OP amp for amplifying microphone output and anti-aliasing. Therefore, this recognizer may play a similar role to dedicated speech recognition chips.

  • PDF