주파수 특성 기저벡터 학습을 통한 특정화자 음성 복원

Target Speaker Speech Restoration via Spectral bases Learning

  • 박선호 (포항공과대학교 컴퓨터공학과) ;
  • 유지호 (포항공과대학교 컴퓨터공학과) ;
  • 최승진 (포항공과대학교 컴퓨터공학과)
  • 발행 : 2009.03.15

초록

본 논문에서는 학습이 가능한 특정화자의 발화음성이 있는 경우, 잡음과 반향이 있는 실 환경에서의 스테레오 마이크로폰을 이용한 특정화자 음성복원 알고리즘을 제안한다. 이를 위해 반향이 있는 환경에서 음원들을 분리하는 다중경로 암묵음원분리(convolutive blind source separation, CBSS)와 이의 후처리 방법을 결합함으로써, 잡음이 섞인 다중경로 신호로부터 잡음과 반향을 제거하고 특정화자의 음성만을 복원하는 시스템을 제시한다. 즉, 비음수 행렬분해(non-negative matrix factorization, NMF) 방법을 이용하여 특정화자의 학습음성으로부터 주파수 특성을 보존하는 기저벡터들을 학습하고, 이 기저벡터들에 기반 한 두 단계의 후처리 기법들을 제안한다. 먼저 본 시스템의 중간단계인 CBSS가 다중경로 신호를 입력받아 독립음원들을(두 채널) 출력하고, 이 두 채널 중 특정화자의 음성에 보다 가까운 채널을 자동적으로 선택한다(채널선택 단계). 이후 앞서 선택된 채널의 신호에 남아있는 잡음과 다른 방해음원(interference source)을 제거하여 특정화자의 음성만을 복원, 최종적으로 잡음과 반향이 제거된 특정화자의 음성을 복원한다(복원 단계). 이 두 후처리 단계 모두 특정화자 음성으로부터 학습한 기저벡터들을 이용하여 동작하므로 특정화자의 음성이 가지는 고유의 주파수 특성 정보를 효율적으로 음성복원에 이용 할 수 있다. 이로써 본 논문은 CBSS에 음원의 사전정보를 결합하는 방법을 제시하고 기존의 CBSS의 분리 결과를 향상시키는 동시에 특정화자만의 음성을 복원하는 시스템을 제안한다. 실험을 통하여 본 제안 방법이 잡음과 반향 환경에서 특정화자의 음성을 성공적으로 복원함을 확인할 수 있다.

This paper proposes a target speech extraction which restores speech signal of a target speaker form noisy convolutive mixture of speech and an interference source. We assume that the target speaker is known and his/her utterances are available in the training time. Incorporating the additional information extracted from the training utterances into the separation, we combine convolutive blind source separation(CBSS) and non-negative decomposition techniques, e.g., probabilistic latent variable model. The nonnegative decomposition is used to learn a set of bases from the spectrogram of the training utterances, where the bases represent the spectral information corresponding to the target speaker. Based on the learned spectral bases, our method provides two postprocessing steps for CBSS. Channel selection step finds a desirable output channel from CBSS, which dominantly contains the target speech. Reconstruct step recovers the original spectrogram of the target speech from the selected output channel so that the remained interference source and background noise are suppressed. Experimental results show that our method substantially improves the separation results of CBSS and, as a result, successfully recovers the target speech.

키워드

참고문헌

  1. L. C. Parra, C. Spence, 'Convolutive blind source separation of non-stationary sources,' IEEE Trans. Speech and Audio Processing 320-327, 2000 https://doi.org/10.1023/A:1008187132177
  2. D. Pham, C. Serviere, H. Boumaraf, 'Blind separation of convolutive audio mixtures using nonstationarity,' in: Proceedings of the International Conference on Independent Component Analysis and Blind Signal Separation, pp. 107-110, 2003
  3. K. Torkkola, 'Blind separation of convolved sources based on information maximization,' in: Proceedings of the IEEE Workshop on Neural Networks for Signal Processing, pp. 423-432, 1996
  4. S. Amari, S. C. Douglas, A. Cichocki, H. H. Yang, 'Multichannel blind deconvolution and equalization using the natural gradient,' in: Proceedings of the IEEE International Conference on Signal Processing Advances in Wireless Communications, Paris, France, pp. 101-104, 1997
  5. P. Smaragdis, 'Information-theoretic approaches to source separation,' Master's thesis, Massachusetts Institute of Technology, 1997
  6. extraction from interferences in real environment using bank of lters and blind source separation, in:Proceedings Third AustralianWorkshop on Signal Processing and Applications, 2000
  7. H. Sawada, S. Araki, R. Mukai, S. Makino, 'Blind extraction of a dominant source from mixtures of many sources using ica and time-frequency masking,' in: Proceedings of the IEEE International Symposium on Circuits and Systems, pp. 5882-5885, 2007 https://doi.org/10.1016/j.sigpro.2007.02.003
  8. S. Y. Low, R. Togneri, S. Nordholm, 'Spatiotemporal processing for distant speech recognition,' in: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, 2004
  9. D. D. Lee, H. S. Seung, 'Algorithms for nonnegative matrix factorization,' in: Advances in Neural Information Processing Systems, Vol. 13, MIT Press, 2001 https://doi.org/10.1038/44565
  10. P. O. Hoyer, 'Non-negative matrix factorization with sparseness constraints,' Journal of Machine Learning Research 5, 1457-1469, 2004
  11. P. D. O. Grady, B. A. Pearlmutter, 'Convolutive non-negative matrix factorisation with sparseness constraint,' in: Proceedings of the IEEE International Workshop on Machine Learning for Signal Processing, 2006
  12. T. Hofmann, 'Probablistic latent semantic indexing,' in: Proceedings of the ACM SIGIR International Conference on Research and Development in Information Retrieval, 1999
  13. B. Raj, P. Smaragdis, 'Latent variable decomposition of spectrograms for single channel speaker separation,' in: IEEE Workshop of Applications of Signal Processing to Audio and Acoustics, pp.17-20, 2005
  14. Jiho Yoo and Seungjin Choi (2008), 'Orthogonal nonnegative matrix factorization: Multiplicative updates on Stiefel manifolds,' in Proceedings of the 9th International Conference on Intelligent Data Engineering and Automated Learning, IDEAL-2008 https://doi.org/10.1016/j.ipm.2009.12.007
  15. E. Visser, M. Otsuka, Lee, 'A spatiotemporal speech enhancement scheme for robust speech recognition in noisy environments,' Speech Communication 41(15), 393-407, 2003 https://doi.org/10.1016/S0167-6393(03)00010-4
  16. C. Choi, G. Jang, Y. Lee, S. R. Kim, 'Adaptive cross-channel interference cancellation on blind source separation outputs,' in: Proceedings of International Conference on Independent Component Analysis and Blind Signal Separation, 2004
  17. J. Kocinski, 'Speech intelligibility improvement using convolutive blind source separation assisted by denoising algorithms,' Speech Communication 50, 29-37, 2008 https://doi.org/10.1016/j.specom.2007.06.003
  18. M. V. S. Shashanka, P. Smaragdis, 'Sparse overcomplete decomposition for single channel speaker separation,' in: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, pp. 641-644, 2007
  19. P. Smaragdis, B. Raj, M. Shashanka, 'Supervised and semi-supervised separation of sounds from single-channel mixtures,' in: Proceedings of International Conference on Independent Component Analysis and Signal Separation, 2007
  20. M. V. S. Shashanka, 'Latent variable framework for modeling and separating single channel acoustic sources,' Ph.D. thesis, Department of Cognitive and Neural Systems, Boston University, 2007
  21. M. Brand, 'Structure learning in conditional probability models via an entropic prior and parameter extinction, Neural Computation,' 11(5), 1155-1182, 1999 https://doi.org/10.1162/089976699300016395
  22. J. F. Cardoso, A. 'Souloumiac, Blind beamforming for non Gaussian signals,' IEE Proceedings-F 140(6), 362-370, 1993.
  23. A. Belouchrani, K. Abed-Merain, J. F. Cardoso, E. Moulines, 'A blind source separation technique using second order statistics,' IEEE Trans. Signal Processing 45, 434-444, 1997 https://doi.org/10.1109/78.554307
  24. S. Choi, A. Cichocki, A. Belouchrani, 'Blind separation of second-order nonstationary and temporally colored sources,' in: Proceedings of IEEE Workshop on Statistical Signal Processing, Singapore, pp. 444-447, 2001
  25. A. Ziehe, P. Laskov, G. Nolte, K. R. Muller, 'A fast algorithm for joint diagonalization with nonorthogonal transformations and its application to blind source separation,' Journal of Machine Learning Research 5, 777-800, 2004
  26. D. T. Pham, 'Joint approximate diagonalization of positive denite matrices,' 22(4), 1163-1152, 2001.
  27. D. R. Campbell, K. J. Palomaki, G. J. Brown, A 'matlab simulation of shoebox room acoustics for use in research and teaching,' Computing and Information Systems Journal 9(3), 1352-1404, 2005
  28. E. Vincent, C. Fevotte, R. Gribonval, 'Performance measurement in blind audio source separation,' IEEE Trans. on Audio, Speech and Language Processing 14(4), 1462-1469, 2006 https://doi.org/10.1109/TSA.2005.858005