DOI QR코드

DOI QR Code

Multi-level Skip Connection for Nested U-Net-based Speech Enhancement

중첩 U-Net 기반 음성 향상을 위한 다중 레벨 Skip Connection

  • Seorim, Hwang (The department of computer science, Yonsei University) ;
  • Joon, Byun (The department of computer science, Yonsei University) ;
  • Junyeong, Heo (The department of computer science, Yonsei University) ;
  • Jaebin, Cha (The division of computer and telecommunications engineering, Yonsei University) ;
  • Youngcheol, Park (The division of software, Yonsei University)
  • 황서림 (연세대학교 일반대학원 전산학과) ;
  • 변준 (연세대학교 일반대학원 전산학과) ;
  • 허준영 (연세대학교 일반대학원 전산학과) ;
  • 차재빈 (연세대학교 컴퓨터정보통신공학부) ;
  • 박영철 (연세대학교 소프트웨어학부)
  • Received : 2022.08.01
  • Accepted : 2022.09.26
  • Published : 2022.11.30

Abstract

In a deep neural network (DNN)-based speech enhancement, using global and local input speech information is closely related to model performance. Recently, a nested U-Net structure that utilizes global and local input data information using multi-scale has bee n proposed. This nested U-Net was also applied to speech enhancement and showed outstanding performance. However, a single skip connection used in nested U-Nets must be modified for the nested structure. In this paper, we propose a multi-level skip connection (MLS) to optimize the performance of the nested U-Net-based speech enhancement algorithm. As a result, the proposed MLS showed excellent performance improvement in various objective evaluation metrics compared to the standard skip connection, which means th at the MLS can optimize the performance of the nested U-Net-based speech enhancement algorithm. In addition, the final proposed m odel showed superior performance compared to other DNN-based speech enhancement models.

심층 신경망(Deep Neural Network) 기반 음성 향상에서 입력 음성의 글로벌 정보와 로컬 정보를 활용하는 것은 모델의 성능과 밀접한 연관성을 갖는다. 최근에는 다중 스케일을 사용하여 입력 데이터의 글로벌 정보와 로컬 정보를 활용하는 중첩 U-Net 구조가 제안되었으며, 이러한 중첩 U-Net은 음성 향상 분야에도 적용되어 매우 우수한 성능을 보였다. 그러나 중첩 U-Net에서 사용되는 단일 skip connection은 중첩된 구조에 알맞게 변형되어야 할 필요성이 있다. 본 논문은 중첩 U-Net 기반 음성 향상 알고리즘의 성능을 최적화하기 위하여 다중 레벨 skip connection(multi-level skip connection, MLS)을 제안하였다. 실험 결과, 제안된 MLS는 기존의 skip connection과 비교하여 다양한 객관적 평가 지표에서 큰 성능 향상을 보이며 이를 통해 MLS가 중첩 U-Net 기반 음성 향상 알고리즘의 성능을 최적화시킬 수 있음을 확인하였다. 또한, 최종 제안 모델은 다른 심층 신경망 기반 음성 향상 모델과 비교하여서도 매우 우수한 성능을 보인다.

Keywords

References

  1. X. Qin, Z. Zhang, C. Huang, M. Dehghan, O. R. Zaiane, and M. Jagersand, "U2-net: Going deeper with nested u-structure for salient object detection," Pattern Recognition, vol. 106, p. 107404, April 2020. doi: http s://doi.org/10.1016/j.patcog.2020.107404
  2. S. Zhao, T. H. Nguyen, and B. Ma, "Monaural speech enhancement with complex convolutional block attention module and joint time frequency losses," in ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6648-6652, 2021.
  3. X. Hao, X. Su, R. Horaud, and X. Li, "Fullsubnet: A full-band and subband fusion model for real-time single-channel speech enhancement," in ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6633-6637, 2021.
  4. X. Xiang, X. Zhang, and H. Chen, "A nested u-net with self-attention and dense connectivity for monaural speech enhancement," IEEE Signal Processing Letters, vol. 29, pp. 105-109, 2022. doi: https://doi.org/10.1109/LSP.2021.3128374
  5. S.-R. Hwang, S.-W. Park, and Y.-C. Park, "Performance comparison evaluation of real and complex networks for deep neural network-based speech enhancement in the frequency domain." The Journal of the Acoustical Society of Korea, vol. 41, no. 1, pp. 30-37, 2022. doi: http s://doi.org/10.7776/ASK.2022.41.1.030
  6. Y. Xian, Y. Sun, W. Wang, and S. M. Naqvi, "A multi-scale feature recalibration network for end-to-end single channel speech enhancement," IEEE Journal of Selected Topics in Signal Processing, vol. 15, no. 1, pp. 143-155, 2021. doi: https://doi.org/10.1109/JSTSP.2020.3045846
  7. H. Huang, L. Lin, R. Tong, H. Hu, Q. Zhang, Y. Iwamoto, X. Han, Y.- W. Chen, and J. Wu, "UNet 3+: A Full-Scale Connected UNet for Medical Image Segmentation." in ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1055-1059, 2020.
  8. H.-S. Choi, J-H Kim, J. Huh, A. Kim, J.-W. Ha, and K. Lee, "Phase-aw are speech enhancement with deep complex u-net," Proc. ICLR. 2019. doi: https://doi.org/10.48550/arXiv.1903.03107
  9. H. Gao, et al. "Densely connected convolutional networks." Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4700-4708, 2017.
  10. J. W. Lyons, DARPA TIMIT acoustic-phonetic continuous speech corpus, 1993.
  11. E. Vincent, J. Barker, S. Watanabe, J. Le Roux, F. Nesta, and M. Matassoni, "The second 'chime'speech separation and recognition challenge: Datasets, tasks and baselines," in IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, pp. 126-130, 2013.
  12. J. Barker, R. Marxer, E. Vincent, and S. Watanabe, "The third 'chime' speech separation and recognition challenge: Dataset, taskand baselines," in IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU). IEEE, pp. 504-511, 2015.
  13. A. Varga and H. J. M. Steeneken, "Assessment for automatic speech recognition: Ii. noisex-92: A database and an experiment to study the effect of additive noise on speech recognition systems," Speech commun., vol. 12, no. 3, p. 247-251, 1993. doi: https://doi.org/10.1016/0167-63 93(93)90095-3
  14. ETSI, 202 396-1: Speech quality performance in the presence of backgr ound noise, 2009.
  15. Y. Hu and P. C. Loizou, "Evaluation of objective quality measures for speech enhancement," IEEE Transactions on Audio, Speech, and Language Processing, vol. 16, no. 1, pp. 229-238, 2008. doi: https://doi.or g/10.1109/TASL.2007.911054