• Title/Summary/Keyword: Audio Data

Search Result 883, Processing Time 0.027 seconds

Noise Robust Automatic Speech Recognition Scheme with Histogram of Oriented Gradient Features

  • Park, Taejin;Beack, SeungKwan;Lee, Taejin
    • IEIE Transactions on Smart Processing and Computing
    • /
    • v.3 no.5
    • /
    • pp.259-266
    • /
    • 2014
  • In this paper, we propose a novel technique for noise robust automatic speech recognition (ASR). The development of ASR techniques has made it possible to recognize isolated words with a near perfect word recognition rate. However, in a highly noisy environment, a distinct mismatch between the trained speech and the test data results in a significantly degraded word recognition rate (WRA). Unlike conventional ASR systems employing Mel-frequency cepstral coefficients (MFCCs) and a hidden Markov model (HMM), this study employ histogram of oriented gradient (HOG) features and a Support Vector Machine (SVM) to ASR tasks to overcome this problem. Our proposed ASR system is less vulnerable to external interference noise, and achieves a higher WRA compared to a conventional ASR system equipped with MFCCs and an HMM. The performance of our proposed ASR system was evaluated using a phonetically balanced word (PBW) set mixed with artificially added noise.

The Development of Virtual Reality Telemedicine System for Treatment of Acrophobia (고소공포증 치료를 위한 가상현실 원격진료 시스템의 개발)

  • Ryu Jong Hyun;Beack Seung Hwa;Paek Seung Eun;Hong Sung Chan
    • The Transactions of the Korean Institute of Electrical Engineers D
    • /
    • v.52 no.4
    • /
    • pp.252-257
    • /
    • 2003
  • Acrophobia is an abnormal fear of heights. Medications or cognitive-behavior methods have been mainly used as a treatment. Lately the virtual reality technology has been applied to that kind of anxiety disorders. A virtual environment provides patient with stimuli which arouses phobia, and exposing to that environment makes him having ability to over come the fear. Recently, the patient can take diagnose from a medical doctor in distance with the telemedicine system. The hospital and doctors can get the medical data, audio, video, signals in the actual examination room or operating room via a live interactive system. Audio visual and multimedia conference service, online questionary, ECG signal transfer system, update system are needed in this system. Virtual reality simulation system that composed with a position sensor, head mount display, and audio system, is also included in this telemedicine system. In this study, we tried this system to the acrophobia patient in distance.

Content-based Music Retrieval by TIP-indexing Techniques and Features of Audio files (TIP-인덱싱 기법과 오디오 화일의 특징계수에 의한 내용기반 음악 검색)

  • Kim Young-In
    • Journal of Korea Society of Industrial Information Systems
    • /
    • v.11 no.3
    • /
    • pp.10-14
    • /
    • 2006
  • To effectively manage a very large amount of music data, we need an indexing technique based on audio features. But the indexing techniques for audiofeatures have not been studied completely. In this paper, we describe a content-based music information retrieval technique for audio features using the TIP-indexing file. In addition, we develop and experiment the TIP-indexing files using various blocking factors to present performance comparisons for effective indexing. Experimental results show the effectiveness of the proposed techniques.

  • PDF

A Lossless and Lossy Audio Compression using Prediction Model and Wavelet Transform

  • Park, Se-Yil;Park, Se-Hyoung;Lim, Dae-Sik;Jaeho Shin
    • Proceedings of the IEEK Conference
    • /
    • 2002.07c
    • /
    • pp.2063-2066
    • /
    • 2002
  • In this paper, we propose a structure far lossless audio coding method. Prediction model is used in the wavelet transform domain. After DWT, wavelet coefficients is quantized and decorrelated by prediction modeling. The DWT can be constructed to critical bands. We can get a lower data rate representation of audio signal which has a good quality like the result of perceptual coding. Then the prediction errors are efficiently coded by the Golomb-coding method. The prediction coefficients are fixed for reducing the computational burden when we find prediction coefficients.

  • PDF

MPEG-4 BIFS Optimization for Interactive T-DMB Content (지상파 DMB 컨텐츠의 MPEG-4 BIFS 최적화 기법)

  • Cha, Kyung-Ae
    • Journal of Korea Society of Industrial Information Systems
    • /
    • v.12 no.1
    • /
    • pp.54-60
    • /
    • 2007
  • The Digital Multimedia Broadcasting(DMB) system is developed to offer high quality multimedia content to the mobile environment. The system adopts the MPEG-4 standard for the main video, audio and other media format. For providing interactive contents, it also adopts the MPEG-4 scene description that refers to the spatio-temporal specifications and behaviors of individual objects. With more interactive contents, the scene description also needs higher bitrate. However, the bandwidth for allocating meta data, such as scene description is restrictive in the mobile environment. On one hand, the DMB terminal renders each media stream according to the scene description. Thus the binary format for scene(BIFS) stream corresponding to the scene description should be decoded and parsed in advance when presenting media data. With this reasoning, the transmission delay of the BIFS stream would cause the delay in transmitting whole audio-visual scene presentations, although the audio or video streams are encoded in very low bitrate. This paper presents the effective optimization technique in adapting the BIFS stream into the expected bitrate without any waste in bandwidth and avoiding transmission delays inthe initial scene description for interactive DMB content.

  • PDF

Classification of Phornographic Videos Using Audio Information (오디오 신호를 이용한 음란 동영상 판별)

  • Kim, Bong-Wan;Choi, Dae-Lim;Bang, Man-Won;Lee, Yong-Ju
    • Proceedings of the KSPS conference
    • /
    • 2007.05a
    • /
    • pp.207-210
    • /
    • 2007
  • As the Internet is prevalent in our life, harmful contents have been increasing on the Internet, which has become a very serious problem. Among them, pornographic video is harmful as poison to our children. To prevent such an event, there are many filtering systems which are based on the keyword based methods or image based methods. The main purpose of this paper is to devise a system that classifies the pornographic videos based on the audio information. We use Mel-Cepstrum Modulation Energy (MCME) which is modulation energy calculated on the time trajectory of the Mel-Frequency cepstral coefficients (MFCC) and MFCC as the feature vector and Gaussian Mixture Model (GMM) as the classifier. With the experiments, the proposed system classified the 97.5% of pornographic data and 99.5% of non-pornographic data. We expect the proposed method can be used as a component of the more accurate classification system which uses video information and audio information simultaneously.

  • PDF

Automatic melody extraction algorithm using a convolutional neural network

  • Lee, Jongseol;Jang, Dalwon;Yoon, Kyoungro
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.11 no.12
    • /
    • pp.6038-6053
    • /
    • 2017
  • In this study, we propose an automatic melody extraction algorithm using deep learning. In this algorithm, feature images, generated using the energy of frequency band, are extracted from polyphonic audio files and a deep learning technique, a convolutional neural network (CNN), is applied on the feature images. In the training data, a short frame of polyphonic music is labeled as a musical note and a classifier based on CNN is learned in order to determine a pitch value of a short frame of audio signal. We want to build a novel structure of melody extraction, thus the proposed algorithm has a simple structure and instead of using various signal processing techniques for melody extraction, we use only a CNN to find a melody from a polyphonic audio. Despite of simple structure, the promising results are obtained in the experiments. Compared with state-of-the-art algorithms, the proposed algorithm did not give the best result, but comparable results were obtained and we believe they could be improved with the appropriate training data. In this paper, melody extraction and the proposed algorithm are introduced first, and the proposed algorithm is then further explained in detail. Finally, we present our experiment and the comparison of results follows.

Development of AVN Software Using Vehicle Information for Hand Gesture (차량정보 분석과 제스처 인식을 위한 AVN 소프트웨어 구현)

  • Oh, Gyu-tae;Park, Inhye;Lee, Sang-yub;Ko, Jae-jin
    • The Journal of Korean Institute of Communications and Information Sciences
    • /
    • v.42 no.4
    • /
    • pp.892-898
    • /
    • 2017
  • This paper describes the development of AVN(Audio Video Navigation) software for vehicle information analysis and gesture recognition. The module that examine the CAN(Controller Area Network) data of vehicle in the designed software analyzes the driving state. Using classified information, the AVN software converge vehicle information and hand gesture information. As the result, the derived data is used to match the service step and to perform the service. The designed AVN software was implemented in HW platform that common used in vehicles. And we confirmed the operation of vehicle analysing module and gesture recognition in a simulated environment that is similar with real world.

Design and Implementation of Multimedia Retrieval a System (멀티미디어 검색 시스템의 설계 및 구현)

  • 노승민;황인준
    • Journal of KIISE:Databases
    • /
    • v.30 no.5
    • /
    • pp.494-506
    • /
    • 2003
  • Recently, explosive popularity of multimedia information has triggered the need for retrieving multimedia contents efficiently from the database including audio, video and images. In this paper, we propose an XML-based retrieval scheme and a data model that complement the weak aspects of annotation and conent based retrieval methods. The Property and hierarchy structure of image and video data are represented and manipulated based on the Multimedia Description Schema (MDS) that conforms to the MPEG-7 standard. For audio contents, pitch contours extracted from their acoustic features are converted into UDR string. Especially, to improve the retrieval performance, user's access pattern and frequency are utilized in the construction of an index. We have implemented a prototype system and evaluated its performance through various experiments.

Prediction of Closed Quotient During Vocal Phonation using GRU-type Neural Network with Audio Signals

  • Hyeonbin Han;Keun Young Lee;Seong-Yoon Shin;Yoseup Kim;Gwanghyun Jo;Jihoon Park;Young-Min Kim
    • Journal of information and communication convergence engineering
    • /
    • v.22 no.2
    • /
    • pp.145-152
    • /
    • 2024
  • Closed quotient (CQ) represents the time ratio for which the vocal folds remain in contact during voice production. Because analyzing CQ values serves as an important reference point in vocal training for professional singers, these values have been measured mechanically or electrically by either inverse filtering of airflows captured by a circumferentially vented mask or post-processing of electroglottography waveforms. In this study, we introduced a novel algorithm to predict the CQ values only from audio signals. This has eliminated the need for mechanical or electrical measurement techniques. Our algorithm is based on a gated recurrent unit (GRU)-type neural network. To enhance the efficiency, we pre-processed an audio signal using the pitch feature extraction algorithm. Then, GRU-type neural networks were employed to extract the features. This was followed by a dense layer for the final prediction. The Results section reports the mean square error between the predicted and real CQ. It shows the capability of the proposed algorithm to predict CQ values.