DOI QR코드

DOI QR Code

Speech extraction based on AuxIVA with weighted source variance and noise dependence for robust speech recognition

강인 음성 인식을 위한 가중화된 음원 분산 및 잡음 의존성을 활용한 보조함수 독립 벡터 분석 기반 음성 추출

  • 신의협 (서강대학교 전자공학과) ;
  • 박형민 (서강대학교 전자공학과)
  • Received : 2022.03.21
  • Accepted : 2022.05.11
  • Published : 2022.05.31

Abstract

In this paper, we propose speech enhancement algorithm as a pre-processing for robust speech recognition in noisy environments. Auxiliary-function-based Independent Vector Analysis (AuxIVA) is performed with weighted covariance matrix using time-varying variances with scaling factor from target masks representing time-frequency contributions of target speech. The mask estimates can be obtained using Neural Network (NN) pre-trained for speech extraction or diffuseness using Coherence-to-Diffuse power Ratio (CDR) to find the direct sounds component of a target speech. In addition, outputs for omni-directional noise are closely chained by sharing the time-varying variances similarly to independent subspace analysis or IVA. The speech extraction method based on AuxIVA is also performed in Independent Low-Rank Matrix Analysis (ILRMA) framework by extending the Non-negative Matrix Factorization (NMF) for noise outputs to Non-negative Tensor Factorization (NTF) to maintain the inter-channel dependency in noise output channels. Experimental results on the CHiME-4 datasets demonstrate the effectiveness of the presented algorithms.

이 논문에서는 배경 잡음이 포함되는 환경에서 강인한 음성 인식을 하기 위한 전처리 단계로서 쓰이는 목표 음성 향상 방법을 제안한다. 보조 함수 기반의 독립 벡터 분석(Auxiliary-function-based Independent Vector Analysis, AuxIVA) 기법을 기반으로 가중 공분산 행렬에서 시간에 따라 변하는 분산에 의해서 가중치가 결정된다. 목표 음성에 대한 시간-주파수별 기여도를 나타내는 마스크를 통해 분산의 크기를 조절한다. 이러한 마스크는 음성 향상을 위해서 학습된 신경망 혹은 목표 화자로부터의 직선 성분의 기여도를 찾기 위한 확산성으로부터 추정할 수 있다. 이에 더하여 둘러싼 잡음에 대한 출력들은 서로 다차원 독립 성분 분석을 도입하여 의존성을 주어 안정적으로 노이즈 성분을 추출할 수 있다. 이 AuxIVA 기반의 목표 음성 추출 알고리즘은 또한 노이즈에 대해서 비음수 행렬 분해(Non-negative Matrix Factorization, NMF)를 비음수 텐서 분해(Non-negative Tensor Factorization, NTF)로 확장하여 독립 단순 행렬 분석(Independent Low-Rank Matrix Analysis, ILRMA)의 틀에서도 수행될 수 있다. 이러한 확장을 통해서 여전히 잡음 출력 채널에서의 채널간 의존성을 유지할 수 있다. CHiME-4데이터셋에 대한 실험 결과는 소개된 알고리즘에 대한 효과를 보여준다.

Keywords

Acknowledgement

이 논문은 정부(과학기술정보통신부)의 재원으로 한국연구재단의 지원을 받아 수행된 연구임(NRF-2020R1A2B5B01002398).

References

  1. T. Virtanen, R. Singh, and B. Raj, Techniques for Noise Robustness in Automatic Speech Recognition (John Wiley & Sons, New York, 2012), pp. 109-154.
  2. M. Wolfel and J. McDonoug, Distant Speech Recognition (John Wiley & Sons, New York, 2009), pp. 387-491.
  3. J. Droppo and A. Acero. Environmental Robustness (Springer, Heidelberg, 2008), pp. 653-680.
  4. M. Kim and H.-M. Park, "Efficient online target speech extraction using DOA-constrained independent component analysis of stereo data for robust speech recognition," Signal Processing, 117, 126-137 (2015). https://doi.org/10.1016/j.sigpro.2015.04.022
  5. L. Albera, "Independent component analysis and applications," Handbook of Blind Source Separation: Independent Component Analysis and Applications, edited by P. Comon and C. Jutten (Academic press, Kidlington, 2010).
  6. S. Haykin, Unsupervised Adaptive Filtering, volume 1: Blind Source Separation (John Wiley & Sons, New York, 2000), pp. 238-258.
  7. A. Hyvarinen, J. Karhunen, and E. Oja, Independent Component Analysis and Blind Source Separation (John Wiley & Son, New York, 2001), pp. 4-42.
  8. Y. Takahashi, T. Takatani, K. Osako, H. Saruwatari, and K. Shikano, "Blind spatial subtraction array for speech enhancement in noisy environment," IEEE Transactions on Audio, Speech, and Language Processing, 17, 650-664 (2009). https://doi.org/10.1109/TASL.2008.2011517
  9. F. Nesta and M. Matassoni, "Blind source extraction for robust speech recognition in multisource noisy environments," Computer Speech and Language, 27, 703-725 (2013). https://doi.org/10.1016/j.csl.2012.08.001
  10. M. El Rhabi, H. Fenniri, A. Keziou, and E. Moreau, "A robust algorithm for convolutive blind source separation in presence of noise," Signal Processing, 93, 818-827 (2013). https://doi.org/10.1016/j.sigpro.2012.09.026
  11. T. Kim, H. T. Attias, S.-Y. Lee, and T.-W. Lee, "Blind source separation exploiting higher-order frequency dependencies," IEEE Transactions on Audio, Speech, and Language Processing, 15, 70-79 (2007). https://doi.org/10.1109/TASL.2006.872618
  12. T. Kim, "Real-time independent vector analysis for convolutive blind source separation," IEEE Transactions on Circuits and Systems I: Regular Papers, 57, 1431-1438 (2010). https://doi.org/10.1109/TCSI.2010.2048777
  13. M. Oh and H.-M. Park, "Blind source separation based on independent vector analysis using feed-forward network," Neurocomputing, 74, 3713-3715 (2011). https://doi.org/10.1016/j.neucom.2011.06.008
  14. I. Lee, G.-J. Jang, and T.-W. Lee, "Independent vector analysis using densities represented by chain-like overlapped cliques in graphical models for separation of convolutedly mixed signals," Electronics Letters, 45, 710-711 (2009). https://doi.org/10.1049/el.2009.0945
  15. C.-H. Choi, W. Chang, and S.-Y. Lee, "Blind source separation of speech and music signals using harmonic frequency dependent independent vector analysis," Electronics Letters, 48, 124-125 (2012). https://doi.org/10.1049/el.2011.3215
  16. N. Ono, "Stable and fast update rules for independent vector analysis based on auxiliary function technique," Proc. IEEE WASPAA, 189-192 (2011).
  17. N. Ono, "Auxiliary-function-based independent vector analysis with power of vector-norm type weighting functions," Proc. APSIPA, 1-4 (2012).
  18. D. D. Lee and H. S. Seung, "Learning the parts of objects by non-negative matrix factorization," Nature, 401, 788 (1999). https://doi.org/10.1038/44565
  19. D. D. Lee and H. S. Seung, "Algorithms for nonnegative matrix factorization," Advances in Neural Information Processing Systems, 13, 556-562 (2001).
  20. D. Kitamura, N. Ono, H. Sawada, H. Kameoka, and H. Saruwatari, "Efficient multichannel nonnegative matrix factorization exploiting rank-1 spatial model," Proc. IEEE ICASSP, 276-280 (2015).
  21. D. Kitamura, N. Ono, H. Sawada, H. Kameoka, and H. Saruwatari, "Determined blind source separation unifying independent vector analysis and nonnegative matrix factorization," IEEE/ACM TASLP, 24, 1622-1637 (2016). https://doi.org/10.1109/TASLP.2016.2577880
  22. U.-H. Shin and H.-M. Park, "Auxiliary-function-based independent vector analysis using generalized inter-clique dependence source models with clique variance estimation," IEEE Access, 8, 68103-68113 (2020). https://doi.org/10.1109/access.2020.2985842
  23. A. R. Lopez, N. Ono, U. Remes, K. Palomaki, and M. Kurimo, "Designing multichannel source separation based on single-channel source separation," Proc. IEEE ICASSP, 469-473 (2015).
  24. Z. Koldovsky, P. Tichavsky, and V. Kautsk, "Orthogonally constrained independent component extraction: Blind MPDR beamforming," Proc. EUSIPCO, 1155-1159 (2017).
  25. T. Kounovsky, Z. Koldovsky, and J. Cmejla, "Recursive and partially supervised algorithms for speech enhancement on the basis of independent vector extraction," Proc. IWAENC, 401-405 (2018).
  26. J.-F. Cardoso, "Multidimensional independent component analysis," Proc. IEEE ICASSP, 4, 1941-1944 (1998).
  27. D. FitzGerald, M. Cranitch, and E. Coyle, "Nonnegative tensor factorisation for sound source separation," Proc. Irish Signals and Systems Conf. 8-12 (2005).
  28. J. Heymann, L. Drude, and R. Haeb-Umbach, "Neural network based spectral mask estimation for acoustic beamforming," Proc. IEEE ICASSP, 196-200 (2016).
  29. A. Schwarz and W. Kellermann, "Coherent-to-diffuse power ratio estimation for dereverberation," IEEE/ACM Transactions on Audio, Speech, and Language Processing, 23, 1006-1018 (2015). https://doi.org/10.1109/TASLP.2015.2418571
  30. R. Lee, M.-S. Kang, B.-H. Kim, K.-H. Park, S. Q. Lee, and H.-M. Park, "Sound source localization based on gcc-phat with diffuseness mask in noisy and reverberant environments," IEEE Access, 8, 7373-7382 (2020). https://doi.org/10.1109/access.2019.2963768
  31. J. Caroselli, I. Shafran, A. Narayanan, and R. Rose, "Adaptive multichannel dereverberation for automatic speech recognition," Proc. Interspeech, 3877-3881 (2017).
  32. B. J. Cho, J.-M. Lee, and H.-M. Park, "A beamforming algorithm based on maximum likelihood of a complex gaussian distribution with time-varying variances for robust speech recognition," IEEE Signal Processing Letters, 26, 1398-1402 (2019). https://doi.org/10.1109/lsp.2019.2932848
  33. E. Vincent, S. Watanabe, A. A. Nugraha, J. Barker, and R. Marxer, "An analysis of environment, microphone and data simulation mismatches in robust speech recognition," Computer Speech & Language, 46, 535-557 (2017). https://doi.org/10.1016/j.csl.2016.11.005
  34. J. Barker, R. Marxer, E. Vincent, and S. Watanabe, "The third 'CHiME' speech separation and recognition challenge: Dataset, task and baselines," Proc. IEEE Workshop on ASRU, 504-511 (2015).
  35. T. Higuchi, N. Ito, T. Yoshioka, and T. Nakatani, "Robust MVDR beamforming using time-frequency masks for online/offline ASR in noise," Proc. IEEE ICASSP, 5210-5214 (2016).
  36. O. L. Frost, "An algorithm for linearly constrained adaptive array processing," Proceedings of the IEEE, 60, 926-935 (1972). https://doi.org/10.1109/PROC.1972.8817