• Title/Summary/Keyword: Mel-spectrogram

Search Result 41, Processing Time 0.021 seconds

Attention Modules for Improving Cough Detection Performance based on Mel-Spectrogram (사전 학습된 딥러닝 모델의 Mel-Spectrogram 기반 기침 탐지를 위한 Attention 기법에 따른 성능 분석)

  • Changjoon Park;Inki Kim;Beomjun Kim;Younghoon Jeon;Jeonghwan Gwak
    • Proceedings of the Korean Society of Computer Information Conference
    • /
    • 2023.01a
    • /
    • pp.43-46
    • /
    • 2023
  • 호흡기 관련 전염병의 주된 증상인 기침은 공기 중에 감염된 병원균을 퍼트리며 비감염자가 해당 병원균에 노출된 경우 높은 확률로 해당 전염병에 감염될 위험이 있다. 또한 사람들이 많이 모이는 공공장소 및 실내 공간에서의 기침 탐지 및 조치는 전염병의 대규모 유행을 예방할 수 있는 효율적인 방법이다. 따라서 본 논문에서는 탐지해야 하는 기침 소리 및 일상생활 속 발생할 수 있는 기침과 유사한 배경 소리 들을 Mel-Spectrogram으로 변환한 후 시각화된 특징을 CNN 모델에 학습시켜 기침 탐지를 진행하며, 일반적으로 사용되는 사전 학습된 CNN 모델에 제안된 Attention 모듈의 적용이 기침 탐지 성능 향상에 도움이 됨을 입증하였다.

  • PDF

Comparison of environmental sound classification performance of convolutional neural networks according to audio preprocessing methods (오디오 전처리 방법에 따른 콘벌루션 신경망의 환경음 분류 성능 비교)

  • Oh, Wongeun
    • The Journal of the Acoustical Society of Korea
    • /
    • v.39 no.3
    • /
    • pp.143-149
    • /
    • 2020
  • This paper presents the effect of the feature extraction methods used in the audio preprocessing on the classification performance of the Convolutional Neural Networks (CNN). We extract mel spectrogram, log mel spectrogram, Mel Frequency Cepstral Coefficient (MFCC), and delta MFCC from the UrbanSound8K dataset, which is widely used in environmental sound classification studies. Then we scale the data to 3 distributions. Using the data, we test four CNNs, VGG16, and MobileNetV2 networks for performance assessment according to the audio features and scaling. The highest recognition rate is achieved when using the unscaled log mel spectrum as the audio features. Although this result is not appropriate for all audio recognition problems but is useful for classifying the environmental sounds included in the Urbansound8K.

Environmental Sound Classification for Selective Noise Cancellation in Industrial Sites (산업현장에서의 선택적 소음 제거를 위한 환경 사운드 분류 기술)

  • Choi, Hyunkook;Kim, Sangmin;Park, Hochong
    • Journal of Broadcast Engineering
    • /
    • v.25 no.6
    • /
    • pp.845-853
    • /
    • 2020
  • In this paper, we propose a method for classifying environmental sound for selective noise cancellation in industrial sites. Noise in industrial sites causes hearing loss in workers, and researches on noise cancellation have been widely conducted. However, the conventional methods have a problem of blocking all sounds and cannot provide the optimal operation per noise type because of common cancellation method for all types of noise. In order to perform selective noise cancellation, therefore, we propose a method for environmental sound classification based on deep learning. The proposed method uses new sets of acoustic features consisting of temporal and statistical properties of Mel-spectrogram, which can overcome the limitation of Mel-spectrogram features, and uses convolutional neural network as a classifier. We apply the proposed method to five-class sound classification with three noise classes and two non-noise classes. We confirm that the proposed method provides improved classification accuracy by 6.6% point, compared with that using conventional Mel-spectrogram features.

Multi-Emotion Recognition Model with Text and Speech Ensemble (텍스트와 음성의 앙상블을 통한 다중 감정인식 모델)

  • Yi, Moung Ho;Lim, Myoung Jin;Shin, Ju Hyun
    • Smart Media Journal
    • /
    • v.11 no.8
    • /
    • pp.65-72
    • /
    • 2022
  • Due to COVID-19, the importance of non-face-to-face counseling is increasing as the face-to-face counseling method has progressed to non-face-to-face counseling. The advantage of non-face-to-face counseling is that it can be consulted online anytime, anywhere and is safe from COVID-19. However, it is difficult to understand the client's mind because it is difficult to communicate with non-verbal expressions. Therefore, it is important to recognize emotions by accurately analyzing text and voice in order to understand the client's mind well during non-face-to-face counseling. Therefore, in this paper, text data is vectorized using FastText after separating consonants, and voice data is vectorized by extracting features using Log Mel Spectrogram and MFCC respectively. We propose a multi-emotion recognition model that recognizes five emotions using vectorized data using an LSTM model. Multi-emotion recognition is calculated using RMSE. As a result of the experiment, the RMSE of the proposed model was 0.2174, which was the lowest error compared to the model using text and voice data, respectively.

A General Acoustic Drone Detection Using Noise Reduction Preprocessing (환경 소음 제거를 통한 범용적인 드론 음향 탐지 구현)

  • Kang, Hae Young;Lee, Kyung-ho
    • Journal of the Korea Institute of Information Security & Cryptology
    • /
    • v.32 no.5
    • /
    • pp.881-890
    • /
    • 2022
  • As individual and group users actively use drones, the risks (Intrusion, Information leakage, and Sircraft crashes and so on) in no-fly zones are also increasing. Therefore, it is necessary to build a system that can detect drones intruding into the no-fly zone. General acoustic drone detection researches do not derive location-independent performance by directly learning drone sound including environmental noise in a deep learning model to overcome environmental noise. In this paper, we propose a drone detection system that collects sounds including environmental noise, and detects drones by removing noise from target sound. After removing environmental noise from the collected sound, the proposed system predicts the drone sound using Mel spectrogram and CNN deep learning. As a result, It is confirmed that the drone detection performance, which was weak due to unstudied environmental noises, can be improved by more than 7%.

A Novel Approach to COVID-19 Diagnosis Based on Mel Spectrogram Features and Artificial Intelligence Techniques

  • Alfaidi, Aseel;Alshahrani, Abdullah;Aljohani, Maha
    • International Journal of Computer Science & Network Security
    • /
    • v.22 no.9
    • /
    • pp.195-207
    • /
    • 2022
  • COVID-19 has remained one of the most serious health crises in recent history, resulting in the tragic loss of lives and significant economic impacts on the entire world. The difficulty of controlling COVID-19 poses a threat to the global health sector. Considering that Artificial Intelligence (AI) has contributed to improving research methods and solving problems facing diverse fields of study, AI algorithms have also proven effective in disease detection and early diagnosis. Specifically, acoustic features offer a promising prospect for the early detection of respiratory diseases. Motivated by these observations, this study conceptualized a speech-based diagnostic model to aid in COVID-19 diagnosis. The proposed methodology uses speech signals from confirmed positive and negative cases of COVID-19 to extract features through the pre-trained Visual Geometry Group (VGG-16) model based on Mel spectrogram images. This is used in addition to the K-means algorithm that determines effective features, followed by a Genetic Algorithm-Support Vector Machine (GA-SVM) classifier to classify cases. The experimental findings indicate the proposed methodology's capability to classify COVID-19 and NOT COVID-19 of varying ages and speaking different languages, as demonstrated in the simulations. The proposed methodology depends on deep features, followed by the dimension reduction technique for features to detect COVID-19. As a result, it produces better and more consistent performance than handcrafted features used in previous studies.

Text-to-speech with linear spectrogram prediction for quality and speed improvement (음질 및 속도 향상을 위한 선형 스펙트로그램 활용 Text-to-speech)

  • Yoon, Hyebin
    • Phonetics and Speech Sciences
    • /
    • v.13 no.3
    • /
    • pp.71-78
    • /
    • 2021
  • Most neural-network-based speech synthesis models utilize neural vocoders to convert mel-scaled spectrograms into high-quality, human-like voices. However, neural vocoders combined with mel-scaled spectrogram prediction models demand considerable computer memory and time during the training phase and are subject to slow inference speeds in an environment where GPU is not used. This problem does not arise in linear spectrogram prediction models, as they do not use neural vocoders, but these models suffer from low voice quality. As a solution, this paper proposes a Tacotron 2 and Transformer-based linear spectrogram prediction model that produces high-quality speech and does not use neural vocoders. Experiments suggest that this model can serve as the foundation of a high-quality text-to-speech model with fast inference speed.

Performance change of defect classification model of rotating machinery according to noise addition and denoising process (노이즈 추가와 디노이징 처리에 따른 회전 기계설비의 결함 분류 모델 성능 변화)

  • Se-Hoon Lee;Sung-Soo Kim;Bi-gun Cho
    • Proceedings of the Korean Society of Computer Information Conference
    • /
    • 2023.07a
    • /
    • pp.1-2
    • /
    • 2023
  • 본 연구는 환경 요인이 통제되어 있는 실험실 데이터에 산업 현장에서 발생하는 유사 잡음을 노이즈로 추가하였을 때, SNR비에 따른 노이즈별 STFT Log Spectrogram, Mel-Spectrogram, CWT Spectrogram 총 3가지의 이미지를 생성하고, 각 이미지를 입력으로 한 CNN 결함 분류 모델의 성능 결과를 확인하였다. 원본 데이터의 영향력이 큰 0db 이상의 SNR비로 합성할 경우 원본 데이터와 분류 결과상 큰 차이가 존재하지 않았으며, 노이즈 데이터의 영향이 큰 0db 이하의 SNR비로 합성할 경우, -20db의 STFT 이미지 기준 약 26%의 성능 저하가 발생하였다. 또한, Wiener Filtering을 통한 디노이징 처리 이후, 노이즈를 효과적으로 제거하여 분류 성능의 결과가 높아지는 점을 확인하였다.

  • PDF

Implementation of Cough Detection System Using IoT Sensor in Respirator

  • Shin, Woochang
    • International journal of advanced smart convergence
    • /
    • v.9 no.4
    • /
    • pp.132-138
    • /
    • 2020
  • Worldwide, the number of corona virus disease 2019 (COVID-19) confirmed cases is rapidly increasing. Although vaccines and treatments for COVID-19 are being developed, the disease is unlikely to disappear completely. By attaching a smart sensor to the respirator worn by medical staff, Internet of Things (IoT) technology and artificial intelligence (AI) technology can be used to automatically detect the medical staff's infection symptoms. In the case of medical staff showing symptoms of the disease, appropriate medical treatment can be provided to protect the staff from the greater risk. In this study, we design and develop a system that detects cough, a typical symptom of respiratory infectious diseases, by applying IoT technology and artificial technology to respiratory protection. Because the cough sound is distorted within the respirator, it is difficult to guarantee accuracy in the AI model learned from the general cough sound. Therefore, coughing and non-coughing sounds were recorded using a sensor attached to a respirator, and AI models were trained and performance evaluated with this data. Mel-spectrogram conversion method was used to efficiently classify sound data, and the developed cough recognition system had a sensitivity of 95.12% and a specificity of 100%, and an overall accuracy of 97.94%.

Comparison of Korean Real-time Text-to-Speech Technology Based on Deep Learning (딥러닝 기반 한국어 실시간 TTS 기술 비교)

  • Kwon, Chul Hong
    • The Journal of the Convergence on Culture Technology
    • /
    • v.7 no.1
    • /
    • pp.640-645
    • /
    • 2021
  • The deep learning based end-to-end TTS system consists of Text2Mel module that generates spectrogram from text, and vocoder module that synthesizes speech signals from spectrogram. Recently, by applying deep learning technology to the TTS system the intelligibility and naturalness of the synthesized speech is as improved as human vocalization. However, it has the disadvantage that the inference speed for synthesizing speech is very slow compared to the conventional method. The inference speed can be improved by applying the non-autoregressive method which can generate speech samples in parallel independent of previously generated samples. In this paper, we introduce FastSpeech, FastSpeech 2, and FastPitch as Text2Mel technology, and Parallel WaveGAN, Multi-band MelGAN, and WaveGlow as vocoder technology applying non-autoregressive method. And we implement them to verify whether it can be processed in real time. Experimental results show that by the obtained RTF all the presented methods are sufficiently capable of real-time processing. And it can be seen that the size of the learned model is about tens to hundreds of megabytes except WaveGlow, and it can be applied to the embedded environment where the memory is limited.