• 제목/요약/키워드: automatic speech recognition

검색결과 212건 처리시간 0.022초

KMSAV: Korean multi-speaker spontaneous audiovisual dataset

  • Kiyoung Park;Changhan Oh;Sunghee Dong
    • ETRI Journal
    • /
    • 제46권1호
    • /
    • pp.71-81
    • /
    • 2024
  • Recent advances in deep learning for speech and visual recognition have accelerated the development of multimodal speech recognition, yielding many innovative results. We introduce a Korean audiovisual speech recognition corpus. This dataset comprises approximately 150 h of manually transcribed and annotated audiovisual data supplemented with additional 2000 h of untranscribed videos collected from YouTube under the Creative Commons License. The dataset is intended to be freely accessible for unrestricted research purposes. Along with the corpus, we propose an open-source framework for automatic speech recognition (ASR) and audiovisual speech recognition (AVSR). We validate the effectiveness of the corpus with evaluations using state-of-the-art ASR and AVSR techniques, capitalizing on both pretrained models and fine-tuning processes. After fine-tuning, ASR and AVSR achieve character error rates of 11.1% and 18.9%, respectively. This error difference highlights the need for improvement in AVSR techniques. We expect that our corpus will be an instrumental resource to support improvements in AVSR.

Modified Phonetic Decision Tree For Continuous Speech Recognition

  • Kim, Sung-Ill;Kitazoe, Tetsuro;Chung, Hyun-Yeol
    • The Journal of the Acoustical Society of Korea
    • /
    • 제17권4E호
    • /
    • pp.11-16
    • /
    • 1998
  • For large vocabulary speech recognition using HMMs, context-dependent subword units have been often employed. However, when context-dependent phone models are used, they result in a system which has too may parameters to train. The problem of too many parameters and too little training data is absolutely crucial in the design of a statistical speech recognizer. Furthermore, when building large vocabulary speech recognition systems, unseen triphone problem is unavoidable. In this paper, we propose the modified phonetic decision tree algorithm for the automatic prediction of unseen triphones which has advantages solving these problems through following two experiments in Japanese contexts. The baseline experimental results show that the modified tree based clustering algorithm is effective for clustering and reducing the number of states without any degradation in performance. The task experimental results show that our proposed algorithm also has the advantage of providing a automatic prediction of unseen triphones.

  • PDF

효과적인 2차 최적화 적용을 위한 Minibatch 단위 DNN 훈련 관점에서의 CNN 구현 (Implementation of CNN in the view of mini-batch DNN training for efficient second order optimization)

  • 송화전;정호영;박전규
    • 말소리와 음성과학
    • /
    • 제8권2호
    • /
    • pp.23-30
    • /
    • 2016
  • This paper describes some implementation schemes of CNN in view of mini-batch DNN training for efficient second order optimization. This uses same procedure updating parameters of DNN to train parameters of CNN by simply arranging an input image as a sequence of local patches, which is actually equivalent with mini-batch DNN training. Through this conversion, second order optimization providing higher performance can be simply conducted to train the parameters of CNN. In both results of image recognition on MNIST DB and syllable automatic speech recognition, our proposed scheme for CNN implementation shows better performance than one based on DNN.

A User friendly Remote Speech Input Unit in Spontaneous Speech Translation System

  • 이광석;김흥준;송진국;추연규
    • 한국정보통신학회:학술대회논문집
    • /
    • 한국해양정보통신학회 2008년도 춘계종합학술대회 A
    • /
    • pp.784-788
    • /
    • 2008
  • In this research, we propose a remote speech input unit, a new method of user-friendly speech input in speech recognition system. We focused the user friendliness on hands-free and microphone independence in speech recognition applications. Our module adopts two algorithms, the automatic speech detection and speech enhancement based on the microphone array-based beamforming method. In the performance evaluation of speech detection, within-200msec accuracy with respect to the manually detected positions is about 97percent under the noise environments of 25dB of the SNR. The microphone array-based speech enhancement using the delay-and-sum beamforming algorithm shows about 6dB of maximum SNR gain over a single microphone and more than 12% of error reduction rate in speech recognition.

  • PDF

과학수사를 위한 한국인 음성 특화 자동화자식별시스템 (Forensic Automatic Speaker Identification System for Korean Speakers)

  • 김경화;소병민;유하진
    • 말소리와 음성과학
    • /
    • 제4권3호
    • /
    • pp.95-101
    • /
    • 2012
  • In this paper, we introduce the automatic speaker identification system 'SPO(Supreme Prosecutors Office) Verifier'. SPO Verifier is a GMM(Gaussian mixture model)-UBM(universal background model) based automatic speaker recognition system and has been developed using Korean speakers' utterances. This system uses a channel compensation algorithm to compensate recording device characteristics. The system can give the users the ability to manage reference models with utterances from various environments to get more accurate recognition results. To evaluate the performance of SPO Verifier on Korean speakers, we compared this system with one of the most widely used commercial systems in the forensic field. The results showed that SPO Verifier shows lower EER(equal error rate) than that of the commercial system.

자동 입력레벨 조절기의 구현 및 인식 성능 향상 (Implementation of Automatic Microphone Volume Controller and Recognition Rate Improvement)

  • 김상진;한민수
    • 대한전자공학회:학술대회논문집
    • /
    • 대한전자공학회 2001년도 제14회 신호처리 합동 학술대회 논문집
    • /
    • pp.503-506
    • /
    • 2001
  • 본 논문에서는 마이크 입력레벨 조절기의 구현과 이를 이용한 인식률의 향상을 다룬다. 마이크를 통한 음성 입력이 너무 작거나 너무 크면 인식률에 직접 영향을 미치므로 인식에 적합한 입력레벨로 조절할 필요가 있다. 자동 입력레벨 조절기의 구현을 위해 고려할 사항을 연구했으며, 이를 통해 PC환경의 입력레벨 조절기를 구현했다. 수집된 음성 데이터베이스는 켑스트럼 평균차감법(CMS)을 이용하여 채널왜곡을 보상했으며, 구현된 조절기를 이용하여 실험한 결과, 이용하지 않은 경우에 비해 약 50%의 오인식율을 줄일 수 있었다.

  • PDF

한국어 공통 음성 DB구축 및 오류 검증 (Common Speech Database Collection and Validation for Communications)

  • 이수종;김상훈;이영직
    • 대한음성학회지:말소리
    • /
    • 제46호
    • /
    • pp.145-157
    • /
    • 2003
  • In this paper, we'd like to briefly introduce Korean common speech database, which project has been started to construct a large scaled speech database since 2002. The project aims at supporting the R&D environment of the speech technology for industries. It encourages domestic speech industries and activates speech technology domestic market. In the first year, the resulting common speech database consists of 25 kinds of databases considering various recording conditions such as telephone, PC, VoIP etc. The speech database will be widely used for speech recognition, speech synthesis, and speaker identification. On the other hand, although the database was originally corrected by manual, still it retains unknown errors and human errors. So, in order to minimize the errors in the database, we tried to find the errors based on the recognition errors and classify several kinds of errors. To be more effective than typical recognition technique, we will develop the automatic error detection method. In the future, we will try to construct new databases reflecting the needs of companies and universities.

  • PDF

발음열 자동 변환을 이용한 한국어 음운 변화 규칙의 통계적 분석 (Statistical Analysis of Korean Phonological Rules Using a Automatic Phonetic Transcription)

  • 이경님;정민화
    • 대한음성학회:학술대회논문집
    • /
    • 대한음성학회 2002년도 11월 학술대회지
    • /
    • pp.81-85
    • /
    • 2002
  • We present a statistical analysis of Korean phonological variations using automatic generation of phonetic transcription. We have constructed the automatic generation system of Korean pronunciation variants by applying rules modeling obligatory and optional phonemic changes and allophonic changes. These rules are derived from knowledge-based morphophonological analysis and government standard pronunciation rules. This system is optimized for continuous speech recognition by generating phonetic transcriptions for training and constructing a pronunciation dictionary for recognition. In this paper, we describe Korean phonological variations by analyzing the statistics of phonemic change rule applications for the 60,000 sentences in the Samsung PBS(Phonetic Balanced Sentence) Speech DB. Our results show that the most frequently happening obligatory phonemic variations are in the order of liaison, tensification, aspirationalization, and nasalization of obstruent, and that the most frequently happening optional phonemic variations are in the order of initial consonant h-deletion, insertion of final consonant with the same place of articulation as the next consonants, and deletion of final consonant with the same place of articulation as the next consonants. These statistics can be used for improving the performance of speech recognition systems.

  • PDF

다양한 음성을 이용한 자동화자식별 시스템 성능 확인에 관한 연구 (Variation of the Verification Error Rate of Automatic Speaker Recognition System With Voice Conditions)

  • 홍수기
    • 대한음성학회지:말소리
    • /
    • 제43호
    • /
    • pp.45-55
    • /
    • 2002
  • High reliability of automatic speaker recognition regardless of voice conditions is necessary for forensic application. Audio recordings in real cases are not consistent in voice conditions, such as duration, time interval of recording, given text or conversational speech, transmission channel, etc. In this study the variation of verification error rate of ASR system with the voice conditions was investigated. As a result in order to decrease both false rejection rate and false acception rate, the various voices should be used for training and the duration of train voices should be longer than the test voices.

  • PDF

잡음환경에서의 음성인식을 위한 변이특성을 고려한 파라메터 (Parameter Considering Variance Property for Speech Recognition in Noisy Environment)

  • 박진영;이광석;고시영;허강인
    • 한국정보통신학회:학술대회논문집
    • /
    • 한국해양정보통신학회 2005년도 추계종합학술대회
    • /
    • pp.469-472
    • /
    • 2005
  • 본 논문에서는 음석인식 시스템을 구현함에 있어서 잡음의 영향에 강인한 특성을 가지는 효과적인 음성특징 파라미터에 대해 제안한다. ASR(Automatic Speech Recognition)에 사용되는 가장 기본적인 파라미터인 MFCC와 DCT를 이용한 DCTCs를 기본적인 파라미터로 설정하였다. 또한, 음성의 변이구간에 대한 정보를 가지도록 Cepstrum을 재구성한 delta-Cepstrum, delta-delta-Cepstrum 파라미터를 제안하고, HMM을 이용하여 인식성능을 비교하였다. 그리고 각각의 파라미터의 차원을 축소하기 위해 LDA 알고리즘을 적용하고 이에 대한 인식성능을 비교하였다. 실험결과 다양한 조건의 잡은 환경에서 기존의 파라미터보다 LDA를 이용하여 차원 축소된 delta-delta-Cepstrum 파라미터가 향상된 인식성능을 나타내었다.

  • PDF