• Title/Summary/Keyword: Speaker recognition systems

Search Result 86, Processing Time 0.246 seconds

End-to-end speech recognition models using limited training data (제한된 학습 데이터를 사용하는 End-to-End 음성 인식 모델)

  • Kim, June-Woo;Jung, Ho-Young
    • Phonetics and Speech Sciences
    • /
    • v.12 no.4
    • /
    • pp.63-71
    • /
    • 2020
  • Speech recognition is one of the areas actively commercialized using deep learning and machine learning techniques. However, the majority of speech recognition systems on the market are developed on data with limited diversity of speakers and tend to perform well on typical adult speakers only. This is because most of the speech recognition models are generally learned using a speech database obtained from adult males and females. This tends to cause problems in recognizing the speech of the elderly, children and people with dialects well. To solve these problems, it may be necessary to retain big database or to collect a data for applying a speaker adaptation. However, this paper proposes that a new end-to-end speech recognition method consists of an acoustic augmented recurrent encoder and a transformer decoder with linguistic prediction. The proposed method can bring about the reliable performance of acoustic and language models in limited data conditions. The proposed method was evaluated to recognize Korean elderly and children speech with limited amount of training data and showed the better performance compared of a conventional method.

Robust Feature Extraction Based on Image-based Approach for Visual Speech Recognition (시각 음성인식을 위한 영상 기반 접근방법에 기반한 강인한 시각 특징 파라미터의 추출 방법)

  • Gyu, Song-Min;Pham, Thanh Trung;Min, So-Hee;Kim, Jing-Young;Na, Seung-You;Hwang, Sung-Taek
    • Journal of the Korean Institute of Intelligent Systems
    • /
    • v.20 no.3
    • /
    • pp.348-355
    • /
    • 2010
  • In spite of development in speech recognition technology, speech recognition under noisy environment is still a difficult task. To solve this problem, Researchers has been proposed different methods where they have been used visual information except audio information for visual speech recognition. However, visual information also has visual noises as well as the noises of audio information, and this visual noises cause degradation in visual speech recognition. Therefore, it is one the field of interest how to extract visual features parameter for enhancing visual speech recognition performance. In this paper, we propose a method for visual feature parameter extraction based on image-base approach for enhancing recognition performance of the HMM based visual speech recognizer. For experiments, we have constructed Audio-visual database which is consisted with 105 speackers and each speaker has uttered 62 words. We have applied histogram matching, lip folding, RASTA filtering, Liner Mask, DCT and PCA. The experimental results show that the recognition performance of our proposed method enhanced at about 21% than the baseline method.

GMM-Based Maghreb Dialect Identification System

  • Nour-Eddine, Lachachi;Abdelkader, Adla
    • Journal of Information Processing Systems
    • /
    • v.11 no.1
    • /
    • pp.22-38
    • /
    • 2015
  • While Modern Standard Arabic is the formal spoken and written language of the Arab world; dialects are the major communication mode for everyday life. Therefore, identifying a speaker's dialect is critical in the Arabic-speaking world for speech processing tasks, such as automatic speech recognition or identification. In this paper, we examine two approaches that reduce the Universal Background Model (UBM) in the automatic dialect identification system across the five following Arabic Maghreb dialects: Moroccan, Tunisian, and 3 dialects of the western (Oranian), central (Algiersian), and eastern (Constantinian) regions of Algeria. We applied our approaches to the Maghreb dialect detection domain that contains a collection of 10-second utterances and we compared the performance precision gained against the dialect samples from a baseline GMM-UBM system and the ones from our own improved GMM-UBM system that uses a Reduced UBM algorithm. Our experiments show that our approaches significantly improve identification performance over purely acoustic features with an identification rate of 80.49%.

Development of Half-Mirror Interface System and Its Application for Ubiquitous Environment (유비쿼터스 환경을 위한 하프미러형 인터페이스 시스템 개발과 응용)

  • Kwon Young-Joon;Kim Dae-Jin;Lee Sang-Wan;Bien Zeungnam
    • Journal of Institute of Control, Robotics and Systems
    • /
    • v.11 no.12
    • /
    • pp.1020-1026
    • /
    • 2005
  • In the era of ubiquitous computing, human-friendly man-machine interface is getting more attention due to its possibility to offer convenient services. For this, in this paper, we introduce a 'Half-Mirror Interface System (HMIS)' as a novel type of human-friendly man-machine interfaces. Basically, HMIS consists of half-mirror, USB-Webcam, microphone, 2ch-speaker, and high-speed processing unit. In our HMIS, two principal operation modes are selected by the existence of the user in front of it. The first one, 'mirror-mode', is activated when the user's face is detected via USB-Webcam. In this mode, HMIS provides three basic functions such as 1) make-up assistance by magnifying an interested facial component and TTS (Text-To-Speech) guide for appropriate make-up, 2) Daily weather information provider via WWW service, 3) Health monitoring/diagnosis service using Chinese medicine knowledge. The second one, 'display-mode' is designed to show decorative pictures, family photos, art paintings and so on. This mode is activated when the user's face is not detected for a time being. In display-mode, we also added a 'healing-window' function and 'healing-music player' function for user's psychological comfort and/or relaxation. All these functions are accessible by commercially available voice synthesis/recognition package.

Design of Model to Recognize Emotional States in a Speech

  • Kim Yi-Gon;Bae Young-Chul
    • International Journal of Fuzzy Logic and Intelligent Systems
    • /
    • v.6 no.1
    • /
    • pp.27-32
    • /
    • 2006
  • Verbal communication is the most commonly used mean of communication. A spoken word carries a lot of informations about speakers and their emotional states. In this paper we designed a model to recognize emotional states in a speech, a first phase of two phases in developing a toy machine that recognizes emotional states in a speech. We conducted an experiment to extract and analyse the emotional state of a speaker in relation with speech. To analyse the signal output we referred to three characteristics of sound as vector inputs and they are the followings: frequency, intensity, and period of tones. Also we made use of eight basic emotional parameters: surprise, anger, sadness, expectancy, acceptance, joy, hate, and fear which were portrayed by five selected students. In order to facilitate the differentiation of each spectrum features, we used the wavelet transform analysis. We applied ANFIS (Adaptive Neuro Fuzzy Inference System) in designing an emotion recognition model from a speech. In our findings, inference error was about 10%. The result of our experiment reveals that about 85% of the model applied is effective and reliable.

Design and Implementation of Data Acquisition and Storage Systems for Multi-view Points Sign Language (다시점 수어 데이터 획득 및 저장 시스템 설계 및 구현)

  • Kim, Geunmo;Kim, Bongjae
    • The Journal of the Institute of Internet, Broadcasting and Communication
    • /
    • v.22 no.3
    • /
    • pp.63-68
    • /
    • 2022
  • There are 395,789 people with hearing impairment in Korea, according to the 2021 Disability Statistics Annual Report by the Korea Institute for the Development of Disabled Persons. These people are experiencing a lot of inconvenience through hearing impairment, and many studies related to recognition and translation of Korean sign language are being conducted to solve this problem. In sign language recognition and translation research, collecting sign language data has many difficulties because few people use sign language professionally. In addition, most of the existed data is sign language data taken from the front of the speaker. To solve this problem, in this paper, we designed and developed a storage system that can collect sign language data based on multi-view points in real-time, rather than a single point, and store and manage it with high usability.