DOI QR코드

DOI QR Code

A study on skip-connection with time-frequency self-attention for improving speech enhancement based on complex-valued spectrum

복소 스펙트럼 기반 음성 향상의 성능 향상을 위한 time-frequency self-attention 기반 skip-connection 기법 연구

  • Jaehee Jung ;
  • Wooil Kim (Department of Computer Science and Engineering, Incheon National University)
  • 정재희 (인천대학교 컴퓨터공학부) ;
  • 김우일 (인천대학교 컴퓨터공학부)
  • Received : 2023.01.25
  • Accepted : 2023.03.08
  • Published : 2023.03.31

Abstract

A deep neural network composed of encoders and decoders, such as U-Net, used for speech enhancement, concatenates the encoder to the decoder through skip-connection. Skip-connection helps reconstruct the enhanced spectrum and complement the lost information. The features of the encoder and the decoder connected by the skip-connection are incompatible with each other. In this paper, for complex-valued spectrum based speech enhancement, Self-Attention (SA) method is applied to skip-connection to transform the feature of encoder to be compatible with the features of decoder. SA is a technique in which when generating an output sequence in a sequence-to-sequence tasks the weighted average of input is used to put attention on subsets of input, showing that noise can be effectively eliminated by being applied in speech enhancement. The three models using encoder and decoder features to apply SA to skip-connection are studied. As experimental results using TIMIT database, the proposed methods show improvements in all evaluation metrics compared to the Deep Complex U-Net (DCUNET) with skip-connection only.

음성 향상에서 많이 사용되는 U-Net과 같이 인코더와 디코더로 구성된 심층 신경망 모델은 skip-connection을 통해 인코더의 특징을 디코더에 연결하는 구조로 구성되어 있다. Skip-connection은 디코더에서 향상된 스펙트럼을 재구성하는데 도움을 주며 인코더를 통해 손실된 정보를 보완해줄 수 있다. 이때 skip-connection을 통해 연결되는 인코더의 특징과 디코더의 특징의 의미는 서로 다르다. 본 논문에서는 복소 스펙트럼 기반 음성 향상의 성능 향상을 위해 디코더에 연결되는 인코더의 특징을 디코더 특징의 의미에 가깝게 변환해주도록 skip-connection에 Self-Attention(SA)을 적용하는 방안을 연구하였다. SA는 시퀀스-시퀀스 문제에서 출력 시퀀스를 생성할 때, 입력 시퀀스의 가중 산술 평균을 이용하여 결정적인 부분을 집중해서 볼 수 있도록 하는 기법으로, 음성 향상 분야에서도 이를 적용함으로써 성능 향상에 효과적임을 입증하는 연구가 진행되었다. SA를 skip-connection에 적용하기 위해 인코더 특징과 디코더 특징을 이용하는 총 3가지의 방법에 대해 연구하였다. TIMIT 데이터베이스를 이용한 음성 향상 실험 결과, 제안하는 방법이 기존 skip-connection으로만 연결된 Deep Complex U-Net(DCUNET)과 비교하여 모든 성능 평가 지표에서 향상된 결과를 보였다.

Keywords

Acknowledgement

본 논문은 인천대학교 2018년자체연구비 지원에 의하여 연구되었음.

References

  1. J. Lim and A. Oppenheim, "All-pole modeling of degraded speech," IEEE Trans. on Acoustics, Speech, and Signal Process. 26, 197-210 (1978). https://doi.org/10.1109/TASSP.1978.1163086
  2. R. Martin, "Spectral subtraction based on minimum statistics," Proc. EUSIPCO, 1182-1185 (1994).
  3. D. L. Wang and J. Chen, "Supervised speech separation based on deep learning: An overview," IEEE/ACM Trans. on Audio, Speech, and Lang. Process. 26, 1702-1726 (2018). https://doi.org/10.1109/TASLP.2018.2842159
  4. H. S. Choi, J. H. Kim, J. Huh, A. Kim, J. W. Ha, and K. Lee, "Phase-aware speech enhancement with deep complex u-net," Proc. ICLR, 1-20 (2019).
  5. C. Trabelsi, O. Bilaniuk, Y. Zhang, D. Serdyuk, S. Subramanian, J. F. Santos, S. Mehri, N. Rostamzadeh, Y. Bengio, and C. J. Pal, "Deep complex networks," Proc. ICLR, 1-19 (2018).
  6. K. Paliwal, K. Wojcicki, and B. Shannon, "The importance of phase in speech enhancement," Speech Communication, 53, 465-494 (2011).
  7. Y. Wang and D. L. Wang, "A deep neural network for time-domain signal reconstruction," Proc. IEEE ICASSP, 4390-4394 (2015).
  8. H. Wang, X. Zhang, and D. L. Wang, "Attention-based fusion for bone-conducted and air-conducted speech enhancement in the complex domain," Proc. IEEE ICASSP, 7757-7761 (2022).
  9. S. Zhao, B. Ma, K. N. Watcharasupat, and W. S. Gan, "FRCRN: Boosting feature representation using frequency recurrence for monaural speech enhancement," Proc. IEEE ICASSP, 9281-9285 (2022).
  10. V. Kothapally and J. H. Hansen, "Complex-valued time-frequency self-attention for speech dereverberation," Proc. Interspeech, 2543-2547 (2022).
  11. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, "Attention is all you need," Proc. NIPS, 6000-6010 (2017).
  12. C. Tang, C. Luo, Z. Zhao, W. Xie, and W. Zeng, "Joint time-frequency and time domain learning for speech enhancement," Proc. 29th IJCAI, 3816-3822 (2021).
  13. V. Kothapally, W. Xia, S. Ghorbani, J. H. Hansen, W. Xue, and J. Huang, "Skipconvnet: Skip convolutional neural network for speech dereverberation using optimally smoothed spectral mapping," Proc. Interspeech, 3935-3939 (2020).
  14. M. Drozdzal, E. Vorontsov, G. Chartrand, S. Kadoury, and C. Pal, "The importance of skip connections in biomedical image segmentation," Proc. DLMIA, 179-187 (2016).
  15. T. Tong, G. Li, X. Liu, and Q. Gao, "Image super-resolution using dense skip connections," Proc. IEEE ICCV, 4799-4807 (2017).
  16. Y. Luo and N. Mesgarani, "Conv-tasnet: surpassing ideal time-frequency magnitude masking for speech separation," IEEE/ACM Trans. on Audio, Speech, and Lang. Process. 27, 1256-1266 (2019). https://doi.org/10.1109/TASLP.2019.2915167
  17. J. S. Garofolo, L. F. Lamel, W. M. Fisher, J. G. Fiscus, D. S. Pallett, and N. L. Dahlgren, "Acoustic-phonetic continuous speech corpus CD-ROM NIST speech disc 1-1.1," DARPA TIMIT, NIST Interagenct/Internal Rep., (NISTIR) 4930, 1993.
  18. E. Vincent, R. Gribonval, and C. Fevotte, "Performance measurement in blind audio source separation," IEEE Trans. on Audio, Speech, and Lang. Process. 14, 1462-1469 (2006). https://doi.org/10.1109/TSA.2005.858005
  19. A. W. Rix, J. G. Beerends, M. P. Hollier, and A. P. Hekstra, "Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs," Proc. IEEE ICASSP, 749-752 (2001).
  20. C. H. Taal, R. C. Hendriks, and R. Heusdens, "A short-time objective intelligibility measure for timefrequency weighted noisy speech," Proc. IEEE ICASSP, 4214-4217 (2010).