DOI QR코드

DOI QR Code

A study on deep neural speech enhancement in drone noise environment

드론 소음 환경에서 심층 신경망 기반 음성 향상 기법 적용에 관한 연구

  • 김지민 (인천대학교 컴퓨터공학부) ;
  • 정재희 (인천대학교 컴퓨터공학부) ;
  • 여찬은 (인천대학교 컴퓨터공학부) ;
  • 김우일 (인천대학교 컴퓨터공학부)
  • Received : 2022.03.23
  • Accepted : 2022.05.18
  • Published : 2022.05.31

Abstract

In this paper, actual drone noise samples are collected for speech processing in disaster environments to build noise-corrupted speech database, and speech enhancement performance is evaluated by applying spectrum subtraction and mask-based speech enhancement techniques. To improve the performance of VoiceFilter (VF), an existing deep neural network-based speech enhancement model, we apply the Self-Attention operation and use the estimated noise information as input to the Attention model. Compared to existing VF model techniques, the experimental results show 3.77%, 1.66% and 0.32% improvements for Source to Distortion Ratio (SDR), Perceptual Evaluation of Speech Quality (PESQ), and Short-Time Objective Intelligence (STOI), respectively. When trained with a 75% mix of speech data with drone sounds collected from the Internet, the relative performance drop rates for SDR, PESQ, and STOI are 3.18%, 2.79% and 0.96%, respectively, compared to using only actual drone noise. This confirms that data similar to real data can be collected and effectively used for model training for speech enhancement in environments where real data is difficult to obtain.

본 논문에서는 재난 환경과 같은 환경에서의 음성 처리를 위해 실제 드론 소음 데이터를 수집하여 오염 음성 데이터베이스를 구축하고 음성 향상 기법인 스펙트럼 차감법과 심층 신경망을 이용한 마스크 기반 음성 향상 기법을 적용하여 성능을 평가한다. 기존의 심층 신경망 기반의 음성 향상 모델인 VoiceFilter(VF)의 성능 향상을 위해 Self-Attention 연산을 적용하고 추정한 잡음 정보를 Attention 모델의 입력으로 이용한다. 기존 VF 모델 기법과 비교하여 Source to Distortion Ratio(SDR), Perceptual Evaluation of Speech Quality(PESQ), Short-Time Objective Intelligibility(STOI)에 대해 각각 3.77 %, 1.66 %, 0.32 % 향상된 결과를 나타낸다. 인터넷에서 수집한 오염 음성 데이터를 75 % 혼합하여 훈련한 경우, 실제 드론 소음만을 사용한 경우에 비해 상대적인 성능 하락률 평균이 SDR, PESQ, STOI에 대해 각각 3.18 %, 2.79 %, 0.96 %를 나타낸다. 이는 실제 데이터를 취득하기 어려운 환경에서 실제 데이터와 유사한 데이터를 수집하여 음성 향상을 위한 모델 훈련에 효과적으로 활용할 수 있음을 확인해준다.

Keywords

Acknowledgement

이 논문은 정부(과학기술정보통신부)의 재원으로 한국연구재단의 지원을 받아 수행된 연구임(No. 2019R1F1A106299513).

References

  1. M.Narinen, Active noise cancellation of drone propeller noise through waveform approximation and Pitch-shifting, (Ph.D. thesis, Georgia State University, 2020).
  2. J. Lim and A. Oppenheim, "All-pole modeling of degraded speech," IEEE Trans. on Acoustics, Speech, and Signal Process. 26, 197-210 (1978). https://doi.org/10.1109/TASSP.1978.1163086
  3. Y. Ephraim and D. Malah, "Speech enhancement using minimum mean square error short time spectral amplitude estimator," IEEE Trans. on Acoustics, Speech and Signal Process. 32, 1109-1121 (1984). https://doi.org/10.1109/TASSP.1984.1164453
  4. S. F. Boll, "Suppression of acoustic noise in speech using spectral subtraction," IEEE Trans. on Acoustics, Speech and Signal Proccess. 27, 113-120 (1979). https://doi.org/10.1109/TASSP.1979.1163209
  5. R. Martin, "Spectral subtraction based on minimum statistics," Proc. EUSIPCO, 1182-1185 (1994).
  6. P. J. Moreno, B. Raj, and R. M. Stern, "Data-driven environmental compensation for speech recognition: a unified approach," Speech Communication, 24, 267-285 (1998). https://doi.org/10.1016/S0167-6393(98)00025-9
  7. Q. Wang, H. Muckenhirn, K. Wilson, P. Sridhar, Z. Wu, J. Hershey, R. A. Saurous, R. J. Weiss, Y. Jia, and I. L. Moreno, "Voicefilter: Targeted voice separation by speaker-conditioned spectrogram masking," arXiv preprint arXiv:1810.04826 (2018).
  8. C. Deng, H. Song, Y. Zhang, Y. Sha, and X. Li, "DNN-based mask estimation integrating spectral and spatial features for robust beamforming," Proc. IEEE ICASSP, 4647-4651 (2020).
  9. N. Saleem, M. I. Khattak, M. Al-Hasan, and A. B. Qazi, "On learning spectral masking for single channel speech enhancement using feedforward and recurrent neural networks," in IEEE Access, 8, 160581-160595 (2020). https://doi.org/10.1109/access.2020.3021061
  10. M. Hasannezhad, Z. Ouyang, W. -P. Zhu, and B. Champagne, "Speech enhancement with phase sensitive mask estimation using a novel hybrid neural network," IEEE Open Journal of Signal Processing, 2, 136-150 (2021). https://doi.org/10.1109/OJSP.2021.3067147
  11. M. Hasannezhad, Z. Ouyang, W. -P. Zhu, and B. Champagne, "An integrated CNN-GRU framework for complex ratio mask estimation in speech enhancement," Proc. APSIPA ASC, 764-768 (2020).
  12. Y. Koizumi, K. Yatabe, M. Delcroix, Y. Masuyama, and D. Takeuchi, "Speech enhancement using self-adaptation and multi-head self-attention," Proc. IEEE ICASSP, 181-185 (2020).
  13. X. Hao, C. Shan, Y. Xu, S. Sun, and L. Xie, "An attention-based neural network approach for single channel speech enhancement," Proc. IEEE ICASSP, 6895-6899 (2019).
  14. S. K. Roy, A. Nicolson, and K. K. Paliwal, "Deep LPC-MHANet: Multi-head self-attention for augmented kalman filter-based speech enhancement," IEEE Access, 9, 70516-70530 (2021). https://doi.org/10.1109/ACCESS.2021.3077281
  15. A. Pandey and D. Wang, "Dense CNN with self-attention for time-domain speech enhancement," IEEE/ACM Trans Audio Speech Lang Process. 29, 1270-1279 (2021). https://doi.org/10.1109/TASLP.2021.3064421
  16. J. S. Garofolo, L. F. Lamel, W. M. Fisher, J. G. Fiscus, D. S. Pallett, and N. L. Dahlgren, "DARPA TIMIT acoustic-phonetic continous speech corpus CD-ROM. NIST speech disc 1-1.1," NASA STI/Recon Tech. Rep. 1993.
  17. Syma X5C-1, https://youtube.com/watch?v=aR3NgjOwzAo&feature=share, (Last viewed August 19, 2021).
  18. E. Vincent, R. Gribonval, and C. Fevotte, "Performance measurement in blind audio source separation," IEEE Trans. on audio, speech, and lang. process. 14, 1462-1469 (2006). https://doi.org/10.1109/TSA.2005.858005
  19. A. W. Rix, J. G. Beerends, M. P. Hollier, and A. P. Hekstra, "Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs," Proc. IEEE ICASSP, 01CH37221 (2001).
  20. C. H. Taal, R. C. Hendriks, R, Heusdens, and J. Jensen, "A short-time objective intelligibility measure for time- frequency weighted noisy speech," Proc. IEEE ICASSP, 4214-4217 (2010).