DOI QR코드

DOI QR Code

Binary Mask Estimation using Training-based SNR Estimation for Improving Speech Intelligibility

음성 명료도 향상을 위한 학습 기반의 신호 대 잡음 비 추정을 이용한 이산 마스크 추정 방법

  • Kim, Gibak (School of Electrical Engineering, Soongsil University)
  • Received : 2012.08.30
  • Accepted : 2012.11.14
  • Published : 2012.11.30

Abstract

This paper deals with a noise reduction algorithm which uses the binary masking approach in the time-frequency domain to improve speech intelligibility. In the binary masking approach, the noise-corrupted speech is decomposed into time-frequency units. Noise-dominant time-frequency units are removed by setting the corresponding binary masks as "0"s and target-dominant units are retained untouched by assigning mask "1"s. We propose a binary mask estimation by comparing the local signal-to-noise ratio (SNR) to a threshold. The local SNR is estimated by a training-based approach. An optimal threshold is proposed, which is obtained from observing the distribution of the training database. The proposed method is evaluated by normal-hearing subjects and the intelligibility scores are computed by counting the number of words correctly recognized.

본 논문에서는 시간-주파수 영역에서의 이산 마스킹을 이용하여 잡음환경 음성의 음성 명료도를 높이는 방법에 대해 다루고자 한다. 잡음이 섞여 있는 음성신호를 시간-주파수 영역으로 분해하여, 상대적으로 잡음이 많이 섞여 있는 시간-주파수 영역의 신호를 마스크 "0"을 할당하여 제거함으로써 음성명료도를 향상시킬 수 있다. 이러한 이산 마스크를 추정하기 위해서는 각 시간-주파수 영역에서 신호 대 잡음 비를 추정하여 문턱값과 비교해야 하는데, 본 논문에서는 학습 기반의 신호 대 잡음 비 추정방법을 사용하여 문턱값과 비교하여 이산 마스크를 추정한다. 신호 대 잡음 비와 비교하기 위한 문턱값은 모든 주파수 대역에 대해 동일한 값을 이용하는 고정 문턱값 외에도 주파수 대역에 따라 학습 데이터의 분포로부터 최적의 값을 사용하는 최적 문턱값을 제안한다. 제안된 이산 마스크 추정 방법은 잡음 환경 데이터에 적용한 후, 피험자에게 들려주어 음성 명료도를 측정한다.

Keywords

References

  1. J. S. Lim and a. V. Oppenheim, "Enhancement and bandwidth compression of noisy speech," Proceedings of the IEEE, vol. 67, no. 12, pp. 1586-1604, 1979.
  2. S. Boll, "Suppression of acoustic noise in speech using spectral subtraction," IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. ASSP-27, no. 2, pp. 113-120, 1979.
  3. Y. Ephraim and D. Malah, "Speech enhancement using a minimum- mean square error short-time spectral amplitude estimator," IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. ASSP-32, no. 6, pp. 1109-1121, 1984.
  4. Y. Ephraim and H. Van Trees, "A signal subspace approach for speech enhancement," IEEE Transactions on Speech and Audio Processing, vol. 3, no. 4, pp. 251-266, 1995. https://doi.org/10.1109/89.397090
  5. J. Huang and Y. Zhao, "An energy-constrained signal subspace method for speech enhancement and recognition in white and colored noises," Speech Communication, vol. 26, no. 3, pp. 165-181, Nov. 1998. https://doi.org/10.1016/S0167-6393(98)00041-7
  6. K. Hermus, P. Wambacq, and H. Hamme, "A Review of Signal Subspace Speech Enhancement and Its Application to Noise Robust Speech Recognition," EURASIP Journal on Advances in Signal Processing, vol. 2007, no. 1, p. 045821, 2007. https://doi.org/10.1155/2007/45821
  7. Y. Hu and P. C. Loizou, "Subjective comparison and evaluation of speech enhancement algorithms," Speech communication, vol. 49, no. 7, pp. 588-601, Jul. 2007. https://doi.org/10.1016/j.specom.2006.12.006
  8. Y. Hu and P. Loizou, "Evaluation of objective quality measures for speech enhancement," IEEE Transactions on Speech and Audio Processing, vol. 16, no. 1, pp. 229-238, 2008. https://doi.org/10.1109/TASL.2007.911054
  9. Y. Hu and P. C. Loizou, "A comparative intelligibility study of single- microphone noise reduction algorithms." The Journal of the Acoustical Society of America, vol. 122, no. 3, p. 1777, Sep. 2007. https://doi.org/10.1121/1.2766778
  10. G. Brown and M. Cooke, "Computational auditory scene analysis," Computer speech and language, vol. 8, pp. 297-336, 1994. https://doi.org/10.1006/csla.1994.1016
  11. D. Wang and G. Brown, Computational Auditory Scene Analysis : Principles, Algorithms, and Applications, Wiley, Hoboken, NJ, 2006.
  12. D. Wang, "On ideal binary mask as the computational goal of auditory scene analysis," In Divenyi P. (ed.), Speech Separation by Humans and Machines, pp. 181-197, Kluwer Academic, Norwell MA, 2005.
  13. D. S. Brungart, P. S. Chang, B. D. Simpson, and D. Wang, "Isolating the energetic component of speech-on-speech masking with ideal time-frequency segregation," The Journal of the Acoustical Society of America, vol. 120, no. 6, p. 4007, 2006. https://doi.org/10.1121/1.2363929
  14. N. Li and P. C. Loizou, "Factors influencing intelligibility of ideal binary- masked speech: implications for noise reduction.," The Journal of the Acoustical Society of America, vol. 123, no. 3, pp. 1673-82, Mar. 2008. https://doi.org/10.1121/1.2832617
  15. N. Li and P. C. Loizou, "Effect of spectral resolution on the intelligibility of ideal binary masked speech.," The Journal of the Acoustical Society of America, vol. 123, no. 4, pp. EL59-64, Apr. 2008. https://doi.org/10.1121/1.2884086
  16. Y. Hu and P. Loizou, "Techniques for estimating the ideal binary mask," in Proc. 11th Int. Workshop Acoust. Echo Noise Control, 2008.
  17. J. Tchorz and B. Kollmeier, "Estimation of the signal-to-noise ratio with amplitude modulation spectrograms," Speech Communication, vol. 38, no. 1-2, pp. 1-17, Sep. 2002. https://doi.org/10.1016/S0167-6393(01)00040-1
  18. J. Tchorz and B. Kollmeier, "SNR estimation based on amplitude modulation analysis with applications to noise suppression," IEEE Transactions on Speech and Audio Processing, vol. 11, no. 3, pp. 184-192, May 2003. https://doi.org/10.1109/TSA.2003.811542
  19. M. Kleinschmidt and V. Hohmann, "Sub-band SNR estimation using auditory feature processing," Speech Communication, vol. 39, no. 1-2, pp. 47-63, Jan. 2003. https://doi.org/10.1016/S0167-6393(02)00058-4
  20. G. Langner and C. E. Schreiner, "Periodicity coding in the inferior colliculus of the cat. I. Neuronal mechanisms.," Journal of neuro-physiology, vol. 60, no. 6, pp. 1799-822, Dec. 1988.
  21. B. Kollmeier and R. Koch, "Speech enhancement based on physiological and psychoacoustical models of modulation perception and binaural interaction.," The Journal of the Acoustical Society of America, vol. 95, no. 3, pp. 1593-602, Mar. 1994. https://doi.org/10.1121/1.408546
  22. S. Stevens, J. Volkmann, and E. Newman, "A scale for the measurement of the psychological magnitude pitch," The Journal of the Acoustical Society of America, vol. 8, no. 3, pp. 185-190, 1937. https://doi.org/10.1121/1.1915893
  23. C. Bishop, Neural Networks for Pattern Recognition, New York: Oxford Univ. Press, 1995.