• Title/Summary/Keyword: Voice Recognition Technique

Search Result 43, Processing Time 0.021 seconds

An Annotation Browsing Technique in e-book for Reading-disabled People Using Voice Recognition (독서장애인 전자책을 위한 음성인식을 이용한 어노테이션 브라우징 기법)

  • Park, Joo-Hyun;Lee, Jong-Woo;Lim, Soon-Bum
    • Proceedings of the Korean Information Science Society Conference
    • /
    • 2012.06c
    • /
    • pp.403-405
    • /
    • 2012
  • 본 연구에서는 독서장애인을 위한 전자책용 어노테이션의 탐색 및 재생 기법을 제안하고 이를 음성 어노테이션 브라우징 시스템이라 칭하였다. 제안된 음성어노테이션 브라우징 시스템은 명령 입력, 중요도 분석 및 추천, 검색, 출력단계로 구성된다. 특히 본 연구에서는 대상 사용자가 청각 의존도가 높은 독서장애인들이기 때문에 완전히 청각에 의존해서 사용할 수 있도록 모든 단계에서 음성인식 기능을 제공한다. 제안된 음성 어노테이션 브라우징 시스템의 효율성을 검증하기 위해 안드로이드 환경에서 실행되는 전자책 소프트웨어와 음성 어노테이션 브라우징 시스템을 설계하고 구현하였다.

Design and Implementation of a Language Identification System for Handwriting Input Data (필기 입력데이터에 대한 언어식별 시스템의 설계 및 구현)

  • Lim, Chae-Gyun;Kim, Kyu-Ho;Lee, Ki-Young
    • The Journal of the Institute of Internet, Broadcasting and Communication
    • /
    • v.10 no.1
    • /
    • pp.63-68
    • /
    • 2010
  • Recently, to accelerate the Ubiquitous generation, the input interface of the mobile machinery and tools are actively being researched. In addition with the existing interfaces such as the keyboard and curser (mouse), other subdivisions including the handwriting, voice, vision, and touch are under research for new interfaces. Especially in the case of small-sized mobile machinery and tools, there is a increasing need for an efficient input interface despite the small screens. This is because, additional installment of other devices are strictly limited due to its size. Previous studies on handwriting recognition have generally been based on either two-dimensional images or algorithms which identify handwritten data inserted through vectors. Futhermore, previous studies have only focused on how to enhance the accuracy of the handwriting recognition algorithms. However, a problem arisen is that when an actual handwriting is inserted, the user must select the classification of their characters (e.g Upper or lower case English, Hangul - Korean alphabet, numbers). To solve the given problem, the current study presents a system which distinguishes different languages by analyzing the form/shape of inserted handwritten characters. The proposed technique has treated the handwritten data as sets of vector units. By analyzing the correlation and directivity of each vector units, a more efficient language distinguishing system has been made possible.

Preprocessing Technique for Improvement of Speech Recognition in a Car (차량에서의 음성인식율 향상을 위한 전처리 기법)

  • Kim, Hyun-Tae;Park, Jang-Sik
    • The Journal of the Korea Contents Association
    • /
    • v.9 no.1
    • /
    • pp.139-146
    • /
    • 2009
  • This paper addresses a modified spectral subtraction schemes which is suitable to speech recognition under low signal-to-noise ratio (SNR) noisy environment such as the automatic speech recognition (ASR) system in car. The conventional spectral subtraction schemes rely on the SNR such that attenuation is imposed on that part of the spectrum that appears to have low SNR, and accentuation is made on that part of high SNR. However, such postulation is adequate for high SNR environment, it is grossly inadequate for low SNR scenarios such as that of car environment. Proposed methods focused specifically to low SNR noisy environment by using weighting function for enhancing speech dominant region in speech spectrum. Experimental results by using voice commands for car show the superior performance of the proposed method over conventional methods.

Spontaneous Speech Emotion Recognition Based On Spectrogram With Convolutional Neural Network (CNN 기반 스펙트로그램을 이용한 자유발화 음성감정인식)

  • Guiyoung Son;Soonil Kwon
    • The Transactions of the Korea Information Processing Society
    • /
    • v.13 no.6
    • /
    • pp.284-290
    • /
    • 2024
  • Speech emotion recognition (SER) is a technique that is used to analyze the speaker's voice patterns, including vibration, intensity, and tone, to determine their emotional state. There has been an increase in interest in artificial intelligence (AI) techniques, which are now widely used in medicine, education, industry, and the military. Nevertheless, existing researchers have attained impressive results by utilizing acted-out speech from skilled actors in a controlled environment for various scenarios. In particular, there is a mismatch between acted and spontaneous speech since acted speech includes more explicit emotional expressions than spontaneous speech. For this reason, spontaneous speech-emotion recognition remains a challenging task. This paper aims to conduct emotion recognition and improve performance using spontaneous speech data. To this end, we implement deep learning-based speech emotion recognition using the VGG (Visual Geometry Group) after converting 1-dimensional audio signals into a 2-dimensional spectrogram image. The experimental evaluations are performed on the Korean spontaneous emotional speech database from AI-Hub, consisting of 7 emotions, i.e., joy, love, anger, fear, sadness, surprise, and neutral. As a result, we achieved an average accuracy of 83.5% and 73.0% for adults and young people using a time-frequency 2-dimension spectrogram, respectively. In conclusion, our findings demonstrated that the suggested framework outperformed current state-of-the-art techniques for spontaneous speech and showed a promising performance despite the difficulty in quantifying spontaneous speech emotional expression.

A Korean Multi-speaker Text-to-Speech System Using d-vector (d-vector를 이용한 한국어 다화자 TTS 시스템)

  • Kim, Kwang Hyeon;Kwon, Chul Hong
    • The Journal of the Convergence on Culture Technology
    • /
    • v.8 no.3
    • /
    • pp.469-475
    • /
    • 2022
  • To train the model of the deep learning-based single-speaker TTS system, a speech DB of tens of hours and a lot of training time are required. This is an inefficient method in terms of time and cost to train multi-speaker or personalized TTS models. The voice cloning method uses a speaker encoder model to make the TTS model of a new speaker. Through the trained speaker encoder model, a speaker embedding vector representing the timbre of the new speaker is created from the small speech data of the new speaker that is not used for training. In this paper, we propose a multi-speaker TTS system to which voice cloning is applied. The proposed TTS system consists of a speaker encoder, synthesizer and vocoder. The speaker encoder applies the d-vector technique used in the speaker recognition field. The timbre of the new speaker is expressed by adding the d-vector derived from the trained speaker encoder as an input to the synthesizer. It can be seen that the performance of the proposed TTS system is excellent from the experimental results derived by the MOS and timbre similarity listening tests.

A Study on Safety Management Improvement System Using Similarity Inspection Technique (유사도검사 기법을 이용한 안전관리 개선시스템 연구)

  • Park, Koo-Rack
    • Journal of the Korea Convergence Society
    • /
    • v.9 no.4
    • /
    • pp.23-29
    • /
    • 2018
  • To reduce the accident rate caused by the delay of corrective action, which is common in the construction site, in order to shorten the time from correcting the existing system to the corrective action, I used a time similarity check to inform the inspectors of the problem in real time, modeling the system so that corrective action can be performed immediately on site, and studied a system that can actively cope with safety accidents. The research result shows that there is more than 90% opening effect and more than 60% safety accident reduction rate. I will continue to study more effective system combining voice recognition and deep learning based on this system.

GMM-Based Gender Identification Employing Group Delay (Group Delay를 이용한 GMM기반의 성별 인식 알고리즘)

  • Lee, Kye-Hwan;Lim, Woo-Hyung;Kim, Nam-Soo;Chang, Joon-Hyuk
    • The Journal of the Acoustical Society of Korea
    • /
    • v.26 no.6
    • /
    • pp.243-249
    • /
    • 2007
  • We propose an effective voice-based gender identification using group delay(GD) Generally, features for speech recognition are composed of magnitude information rather than phase information. In our approach, we address a difference between male and female for GD which is a derivative of the Fourier transform phase. Also, we propose a novel way to incorporate the features fusion scheme based on a combination of GD and magnitude information such as mel-frequency cepstral coefficients(MFCC), linear predictive coding (LPC) coefficients, reflection coefficients and formant. The experimental results indicate that GD is effective in discriminating gender and the performance is significantly improved when the proposed feature fusion technique is applied.

The Conference Management System Architecture for Ontological Knowledge (지식의 온톨로지화를 위한 관리 시스템 아키텍처)

  • Hong, Hyun-Woo;Koh, Gwang-san;Kim, Chang-Soo;Jeong, Jae-Gil;Jung, Hoe-kyung
    • Proceedings of the Korean Institute of Information and Commucation Sciences Conference
    • /
    • v.9 no.2
    • /
    • pp.1115-1118
    • /
    • 2005
  • With the development of the internet technology, The on-line conference system have been producted. Now, the on-line conference system is developing for using pattern recognition system and voice recognition system. Comparing with the off-line conference, the on-line conference is excellent in free from distance limitation. But, the on-line meetings have unavoidable weak points. it is the same as the off-line conference that when the conference goes on, the content orthopedic and the content consistency is weak. So the conference members can not seize the conference flow. Therefore, in this paper, we introduce the ontology concept. Design a new architecture using ontology mining technique for making the conference content and conference knowledge ontological. Then in order to inspection the new architecture, We design and implementation the new conference management system based knowledge.

  • PDF

A SURVEY ON THE PARENTAL PREFERENCE ON PEDIATRIC DENTIST AND THEIR BEHAVIOR MANAGEMENT TECHNIQUE (소아치과 의사와 행동조절방법에 대한 보호자의 선호도 조사)

  • Park, Soo-Jin;Jung, Tae-Sung;Kim, Shin
    • Journal of the korean academy of Pediatric Dentistry
    • /
    • v.29 no.2
    • /
    • pp.204-209
    • /
    • 2002
  • The purpose of this survey was to investigate parental recognition and preference on pediatric dentist and their behavior management technique. The subjects were the parents of new children visiting the Department of Pediatric Dentistry, Pusan National University Hospital for 6 months. The questionnaire was performed over 2 times : at 1st visit and 1 month after that. The parental preference about pediatric dentist - one's sex, color of gown and glass-wearing - and about behavior management technique - parental separation, oral sedation, voice control and physical restraints-were asked through the questionaire and obtained the results were as fellows: 1. The preference on sex of dentists was not shown. 2. The parents recognized not so close relation between glass-wearing and children's anxiety level, but on color of gown, showed various opinions. 3. Most parents opposed to the separation from their children in operatory. 4. For the behavior management technique, parents accepted generally. 5. There was no significant difference between the first and second survey.

  • PDF

A study on combination of loss functions for effective mask-based speech enhancement in noisy environments (잡음 환경에 효과적인 마스크 기반 음성 향상을 위한 손실함수 조합에 관한 연구)

  • Jung, Jaehee;Kim, Wooil
    • The Journal of the Acoustical Society of Korea
    • /
    • v.40 no.3
    • /
    • pp.234-240
    • /
    • 2021
  • In this paper, the mask-based speech enhancement is improved for effective speech recognition in noise environments. In the mask-based speech enhancement, enhanced spectrum is obtained by multiplying the noisy speech spectrum by the mask. The VoiceFilter (VF) model is used as the mask estimation, and the Spectrogram Inpainting (SI) technique is used to remove residual noise of enhanced spectrum. In this paper, we propose a combined loss to further improve speech enhancement. In order to effectively remove the residual noise in the speech, the positive part of the Triplet loss is used with the component loss. For the experiment TIMIT database is re-constructed using NOISEX92 noise and background music samples with various Signal to Noise Ratio (SNR) conditions. Source to Distortion Ratio (SDR), Perceptual Evaluation of Speech Quality (PESQ), and Short-Time Objective Intelligibility (STOI) are used as the metrics of performance evaluation. When the VF was trained with the mean squared error and the SI model was trained with the combined loss, SDR, PESQ, and STOI were improved by 0.5, 0.06, and 0.002 respectively compared to the system trained only with the mean squared error.