DOI QR코드

DOI QR Code

Performance comparison evaluation of speech enhancement using various loss functions

다양한 손실 함수를 이용한 음성 향상 성능 비교 평가

  • 황서림 (연세대학교 컴퓨터정보통신공학부) ;
  • 변준 (연세대학교 컴퓨터정보통신공학부) ;
  • 박영철 (연세대학교 컴퓨터정보통신공학부)
  • Received : 2021.01.19
  • Accepted : 2021.02.23
  • Published : 2021.03.31

Abstract

This paper evaluates and compares the performance of the Deep Nerual Network (DNN)-based speech enhancement models according to various loss functions. We used a complex network that can consider the phase information of speech as a baseline model. As the loss function, we consider two types of basic loss functions; the Mean Squared Error (MSE) and the Scale-Invariant Source-to-Noise Ratio (SI-SNR), and two types of perceptual-based loss functions, including the Perceptual Metric for Speech Quality Evaluation (PMSQE) and the Log Mel Spectra (LMS). The performance comparison was performed through objective evaluation and listening tests with outputs obtained using various combinations of the loss functions. Test results show that when a perceptual-based loss function was combined with MSE or SI-SNR, the overall performance is improved, and the perceptual-based loss functions, even exhibiting lower objective scores showed better performance in the listening test.

본 논문은 다양한 손실 함수에 따른 Deep Nerual Network(DNN) 기반 음성 향상 모델의 성능을 비교 평가한다. 베이스라인 모델로는 음성의 위상 정보를 고려할 수 있는 복소 네트워크를 사용하였다. 손실 함수는 두 가지 유형의 기본 손실 함수, Mean Squared Error(MSE)와 Scale-Invariant Source-to-Noise Ratio(SI-SNR)를 사용하였으며 두 가지 유형의 지각 기반 손실 함수 Perceptual Metric for Speech Quality Evaluation(PMSQE)과 Log Mel Spectra(LMS)를 사용한다. 성능은 각 손실 함수의 다양한 조합을 사용하여 얻은 출력을 객관적인 평가와 청취 테스트를 통해 측정하였다. 실험 결과, 지각기반 손실 함수를 MSE 또는 SI-SNR과 결합하였을 때 전반적으로 성능이 향상되며, 지각기반 손실함수를 사용하면 객관적 지표에서 약세를 보이는 경우라도 청취 테스트에서 우수한 성능을 보임을 확인하였다.

Keywords

References

  1. H. Zhao, S. Zarar, I. Tashev, and C. Lee, "Convolutional recurrent neural networks for speech enhancement," Proc. IEEE ICASSP. 2401-2405 (2018).
  2. D. S. Williamson, Y. Wang, and D. Wang, "Complex ratio masking for monaural speech separation," IEEE/ACM Trans. on audio, speech, and Lang. Pross. 24, 483-492 (2015). https://doi.org/10.1109/tmech.2019.2893055
  3. Y. Hu, Y. Liu, S. Lv, M. Xing, S. Zhang, Y. Fu, J. Wu, B. Zhang, and L. Xie, "Dccrn: Deep complex convolution recurrent network for phase-aware speech enhancement," arXiv:2008.00264 (2020).
  4. M. Kolbk, Z. Tan, S. H. Jensen, and J. Jensen, "On loss functions for supervised monaural time-domain speech enhancement," IEEE/ACM Trans. on Audio, Speech, and Lang. Pross. 28, 825-838 (2020). https://doi.org/10.1109/taslp.2020.2968738
  5. S. Braun and I. Tashev, "A consolidated view of loss functions for supervised deep learning-based speech enhancement," arXiv:2009.12286 (2020).
  6. S. Fu, C. Liao, and Y. Tsao, "Learning with learned loss function: Speech enhancement with quality-net to improve perceptual evaluation of speech quality," IEEE Signal Processing Letters, 27, 26-30 (2020). https://doi.org/10.1109/lsp.2019.2953810
  7. J. M. Martin-Donas, A. M. Gomez, J. A. Gonzalez, and A. M. Peinado, "A deep learning loss function based on the perceptual evaluation of the speech quality," IEEE Signal Processing Letters, 25, 1680-1684 (2018). https://doi.org/10.1109/lsp.2018.2871419
  8. S. Kankanahalli, "End-to-end optimized speech coding with deep neural networks," Proc. IEEE ICASSP. 2521-2525 (2018).
  9. ITU-T. Rec. P.800, Methods for Subjective Determination of Transmission Quality, E 9713, 1996.
  10. W. A. Jassim, J. Skoglund, M. Chinen, and A. Hines, "Speech quality factors for traditional and neural-based low bit rate vocoders," arXiv:2003.11882 (2020).