DOI QR코드

DOI QR Code

A study on speech enhancement using complex-valued spectrum employing Feature map Dependent attention gate

특징 맵 중요도 기반 어텐션을 적용한 복소 스펙트럼 기반 음성 향상에 관한 연구

  • Jaehee Jung ;
  • Wooil Kim (Department of Computer Science and Engineering, Incheon National University)
  • 정재희 (인천대학교 컴퓨터공학부) ;
  • 김우일 (인천대학교 컴퓨터공학부)
  • Received : 2023.08.08
  • Accepted : 2023.09.08
  • Published : 2023.11.30

Abstract

Speech enhancement used to improve the perceptual quality and intelligibility of noise speech has been studied as a method using a complex-valued spectrum that can improve both magnitude and phase in a method using a magnitude spectrum. In this paper, a study was conducted on how to apply attention mechanism to complex-valued spectrum-based speech enhancement systems to further improve the intelligibility and quality of noise speech. The attention is performed based on additive attention and allows the attention weight to be calculated in consideration of the complex-valued spectrum. In addition, the global average pooling was used to consider the importance of the feature map. Complex-valued spectrum-based speech enhancement was performed based on the Deep Complex U-Net (DCUNET) model, and additive attention was conducted based on the proposed method in the Attention U-Net model. The results of the experiments on noise speech in a living room environment showed that the proposed method is improved performance over the baseline model according to evaluation metrics such as Source to Distortion Ratio (SDR), Perceptual Evaluation of Speech Quality (PESQ), and Short Time Object Intelligence (STOI), and consistently improved performance across various background noise environments and low Signal-to-Noise Ratio (SNR) conditions. Through this, the proposed speech enhancement system demonstrated its effectiveness in improving the intelligibility and quality of noisy speech.

잡음 음성의 지각적 품질과 명료도 향상을 위해 활용되는 음성 향상은 크기 스펙트럼을 이용한 방법에서 크기와 위상을 같이 향상시킬 수 있는 복소 스펙트럼을 이용한 방법으로 연구되어왔다. 본 논문에서는 잡음 음성의 명료도와 품질을 더욱 향상시키기 위해 복소 스펙트럼 기반 음성 향상 시스템에 어텐션 기법을 적용하는 방안에 관해 연구를 수행하였다. 어텐션 기법은 additive attention을 기반으로 수행하며 복소 스펙트럼의 특성을 고려하여 어텐션 가중치를 계산할 수 있도록 하였다. 또한 특징 맵의 중요도를 고려하기 위해 전역 평균 풀링 연산을 같이 사용하였다. 복소 스펙트럼 기반 음성 향상은 Deep Complex U-Net(DCUNET) 모델을 기반으로 수행하였으며, additive attention은 Attention U-Net 모델에서 제안된 방법을 기반으로 연구를 수행하였다. 거실 환경의 잡음 데이터에 대해 음성 향상을 수행한 결과, 제안한 방법이 Source to Distortion Ratio(SDR), Perceptual Evaluation of Speech Quality(PESQ), Short Time Objective Intelligibility(STOI) 평가 지표에서 기준 모델보다 개선된 성능을 보였으며, 낮은 Signal-to-Noise Ratio(SNR) 조건의 다양한 배경 잡음 환경에 대해서도 일관된 성능 향상을 보였다. 이를 통해 제안한 음성 향상 시스템이 효과적으로 잡음 음성의 명료도와 품질을 향상시킬 수 있음을 보여주었다.

Keywords

Acknowledgement

이 논문은 정부(과학기술정보통신부)의 재원으로 한국연구재단의 지원을 받아 수행된 연구임(NRF-2021R1F1A1063347).

References

  1. J. Lim and A. Oppenheim, "All-pole modeling of degraded speech," IEEE Trans. Acoust. Speech Signal Process. 26, 197-210 (1978).  https://doi.org/10.1109/TASSP.1978.1163086
  2. S. Boll, "Suppression of acoustic noise in speech using spectral subtraction," IEEE Trans. Acoust. Speech Signal Process. 27, 113-120 (1979).  https://doi.org/10.1109/TASSP.1979.1163209
  3. K. Tan and D. Wang, "A convolutional recurrent neural network for real-time speech enhancement," Proc. Interspeech, 3229-3233 (2018). 
  4. Y. Hu, Y. Liu, S. Lv, M. Xing, S. Zhang, Y. Fu, J. Wu, B. Zhang, and L. Xie, "DCCRN: Deep complex convolution recurrent network for phase-aware speech enhancement," Proc. Interspeech, 2472-2476 (2020). 
  5. H. S. Choi, J. H. Kim, J. Huh, A. Kim, J. W. Ha, and K. Lee, "Phase-aware speech enhancement with deep complex u-net," Proc. ICLR, 1-20 (2019). 
  6. D. Wang and J. Chen, "Supervised speech separation based on deep learning: An overview," IEEE/ACM Trans. Audio, Speech, Language Process. 26, 1702-1726 (2018).  https://doi.org/10.1109/TASLP.2018.2842159
  7. K. Paliwal, K. Wojcicki, and B. Shannon, "The importance of phase in speech enhancement," Speech Communication, 53, 465-494 (2011).  https://doi.org/10.1016/j.specom.2010.12.003
  8. Y. Wang and D. L. Wang, "A deep neural network for time-domain signal reconstruction," Proc. ICASSP, 4390-4394 (2015). 
  9. A. Li, C. Zheng, C. Fan, R. Peng, and X. Li, "A recursive network with dynamic attention for monaural speech enhancement," Proc. Interspeech, 2422-2426 (2020). 
  10. Y. Koizumi, K. Yatabe, M. Delcroix, Y. Masuyama, and D. Takeuchi, "Speech enhancement using self-adaptation and multi-head self-attention," Proc. ICASSP, 181-185 (2020). 
  11. Z. Qiquan, S. Qi, N. Zhaoheng, N. Aaron, and L. Haizhou, "Time-Frequency Attention for Monaural Speech Enhancement," Proc. ICASSP, 7852-7856 (2022). 
  12. O. Oktay, J. Schlemper, L. L. Folgoc, M. Lee, M. Heinrich, K. Misawa, K. Mori, S. McDonagh, N. Y. Hammerla, B. Kainz, B. Glocker, and D. Rueckert, "Attention u-net: Learning where to look for the pancreas," Proc. MIDL, 1-10 (2018). 
  13. Y. Luo and N. Mesgarani, "Conv-tasnet: Surpassing ideal time-frequency magnitude masking for speech separation," IEEE/ACM Trans. Audio, Speech, Language Process. 27, 1256-1266 (2019).  https://doi.org/10.1109/TASLP.2019.2915167
  14. J. S. Garofolo, L. F. Lamel, W. M. Fisher, J. G. Fiscus, D. S. Pallett, and N. L. Dahlgren, "Acoustic-phonetic continuous speech corpus CD-ROM NIST speech disc 1-1.1," DARPA TIMIT, NIST Interagenct/Internal Rep., (NISTIR) 4930, 1993. 
  15. E. Vincent, R. Gribonval, and C. Fevotte, "Performance measurement in blind audio source separation," IEEE Trans. Audio, Speech, Language Process. 14, 1462-1469 (2006).  https://doi.org/10.1109/TSA.2005.858005
  16. A. W. Rix, J. G. Beerends, M. P. Hollier, and A. P. Hekstra, "Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs," Proc. ICASSP, 749-752 (2001). 
  17. C. H. Taal, R. C. Hendriks, and R. Heusdens, "A short-time objective intelligibility measure for time-frequency weighted noisy speech," Proc. ICASSP, 4214-4217 (2010). 
  18. O. Ronneberger, P. Fischer, and T. Brox, "U-net: Convolutional networks for biomedical image segmentation," Proc. MICCAI, 234-241 (2015).