An Adaptation Method in Noise Mismatch Conditions for DNN-based Speech Enhancement

Xu, Si-Ying;Niu, Tong;Qu, Dan;Long, Xing-Yan;

doi:10.3837/tiis.2018.10.017

KSII Transactions on Internet and Information Systems (TIIS)

Volume 12 Issue 10
/
Pages.4930-4951
/
2018
/
1976-7277(pISSN)
/
1976-7277(eISSN)

Korean Society for Internet Information (한국인터넷정보학회)

DOI QR Code

An Adaptation Method in Noise Mismatch Conditions for DNN-based Speech Enhancement

Xu, Si-Ying (National Digital Switching System Engineering & Technological R&D Center) ;
Niu, Tong (National Digital Switching System Engineering & Technological R&D Center) ;
Qu, Dan (National Digital Switching System Engineering & Technological R&D Center) ;
Long, Xing-Yan (National Digital Switching System Engineering & Technological R&D Center)

Received : 2017.12.26
Accepted : 2018.05.03
Published : 2018.10.31

https://doi.org/10.3837/tiis.2018.10.017 Citation PDF KSCI

Download PDF

⟨ Previous Next ⟩

Abstract

The deep learning based speech enhancement has shown considerable success. However, it still suffers performance degradation under mismatch conditions. In this paper, an adaptation method is proposed to improve the performance under noise mismatch conditions. Firstly, we advise a noise aware training by supplying identity vectors (i-vectors) as parallel input features to adapt deep neural network (DNN) acoustic models with the target noise. Secondly, given a small amount of adaptation data, the noise-dependent DNN is obtained by using $L_2$ regularization from a noise-independent DNN, and forcing the estimated masks to be close to the unadapted condition. Finally, experiments were carried out on different noise and SNR conditions, and the proposed method has achieved significantly 0.1%-9.6% benefits of STOI, and provided consistent improvement in PESQ and segSNR against the baseline systems.

Keywords

References

Philipos. C. Loizou, "Speech Enhancement: Theory and Practice," 2nd ed. Boca Raton, FL, USA: CRC Press, Inc., 2013.
Steven Boll. "Suppression of acoustic noise in speech using spectral subtraction," IEEE Transactions on Acoustics, Speech and Signal Processing, vol. 27, no. 2, pp. 113-122, 1979. https://doi.org/10.1109/TASSP.1979.1163209
J. Lim and A. Oppenheim, "All-pole modeling of degraded speech," IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 26, no. 3, pp. 197-210, Jun 1978. https://doi.org/10.1109/TASSP.1978.1163086
Y. Ephraim, "Statistical-model-based speech enhancement systems," in Proc. of Proceedings of the IEEE, vol. 80, no. 10, pp. 1526-1555, Oct 1992. https://doi.org/10.1109/5.168664
Kevin W Wilson, Bhiksha Raj, and Paris Smaragdis, "Regularized non-negative matrix factorization with temporal dependencies for speech denoising," Interspeech, pp. 411-414, 2008a.
Y. Bengio, "Learning deep architectures for AI," Found. Trends Mach. Learn., vol. 2, no. 1, pp. 1-127, 2009. https://doi.org/10.1561/2200000006
Xugang Lu, Yu Tsao, Shigeki Matsuda, and Chiori Hori, "Speech enhancement based on deep denoising autoencoder," INTERSPEECH, pp. 436-440, 2013.
Bing-yin Xia and Chang-chun Bao, "Speech enhancement with weighted denoising auto-encoder," INTERSPEECH, pp. 3444-3448, 2013.
Bingyin Xia and Changchun Bao, "Wiener filtering based speech enhancement with weighted denoising auto-encoder and noise classification," Speech Communication, vol. 60, pp.13-29, 2014. https://doi.org/10.1016/j.specom.2014.02.001
D. L. Wang and G. J. Brown, "Computational auditory scene analysis: Principles, algorithms, and applications," Wiley-IEEE Press, 2006.
Kim, D.Y., Kwan Un, C., Kim, N.S., "Speech recognition in noisy environments using first-order vector Taylor Series," Speech Communication, vol. 24, no. 1, pp. 39-49, 1998. https://doi.org/10.1016/S0167-6393(97)00061-7
Li, J., Deng, L., Yu, D., Gonf, Y., Acero, A., "A unified framework of HMM adaptation with joint compensation of additive and convolutive distortions via vector Taylor series," in Proc. of IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pp. 65-70, 2007.
Li, J., Deng, L., Yu, D., Gong, Y., Acero, A., "HMM adaptation using a phase-sensitive acoustic distortion model for environment-robust speech recognition," in Proc. of International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4069-4072, 2008.
Moreno, P.J., Raj, B., Stern, R.M., "A vector Taylor series approach for environment-independent speech recognition," in Proc. of International Conference on Acoustics, Speech and Signal Processing(ICASSP), pp. 733-736, 1996.
Seide, F., Li, G., Chen, X., Yu, D., "Feature engineering in context-dependent deep neural networks for conversational speech transcription," in Proc. of IEEE Workshop on Automatic Speech Recognition and Understanding (ARSU), pp. 24-29, 2011.
Seltzer, M., Yu, D., Wang, Y., "An investigation of deep neural networks for noise robust speech recognition," in Proc. of International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2013.
Yu, D., Seltzer, M.L., Li, J., Huang, J.T., Seide, F., "Feature learning in deep neural networks-studied on speech recognition tasks," in Proc. of International Conference on Learning Representation (ICLR), 2013.
Abdel-Hamid, O., Jiang, H., "Fast speaker adaptation of hybrid NN/HMM model for speech recognition based on discriminative learning of speaker code," in Proc. of International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7942-7946, 2013.
Shaofei Xue, Ossama Abdel-Halmid, Hui Jiang, Lirong Dai, "Direct Adaptation of Hybrid DNN/HMM Model for Fast Speaker Adaptation in LVCSR Based on Speaker Code," in Proc. of International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6389-6393, 2013.
D. K. Kim and N. S. Kim, "Baysian speaker adaptation based on probabilistic principal component analysis," INTERSPEECH, pp. 734-737, 2000.
George Saon, Hagen Soltau, David Nahamoo, and Michael Picheny, "Speaker adaptation of neural network acoustic models using i-vectors," in Proc. of IEEE Workshop on Automatic Speech Recognition and Understanding (ARSU), pp. 55-59, 2013.
Dehak, N., Kenny, P., Dehak, R., Dumouchel, P., Ouellet, P., "Front-end factor analysis for speaker verification," IEEE Transactions on Audio, Speech and Language Processing, vol. 19, no. 4, pp. 788-798, 2011. https://doi.org/10.1109/TASL.2010.2064307
Glembek, O., Burget, L., Matejka, P., Karafiat, M., Kenny, P., "Simplification and optimization of i-vector extraction," in Proc. of International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4515-4519, 2011.
Wang D, "On ideal binary mask as the computational goal of auditory scene analysis," Divenyi, P., editor. Speech Separation by Humans and Machines. Norwell, MA, USA: Kluwer: pp.181-197, 2005.
Kjems U, Boldt, J, Pedersen M, Lunner T, Wang D, "Role of mask pattern in intelligibility of ideal binary-masked noisy speech," The Journal of the Acoustical Society of America, vol. 126, pp. 1415-1426, 2009. https://doi.org/10.1121/1.3179673
Li N, Loizou P, "Factors influencing intelligibility of ideal binary masked speech: Implications for noise reduction," The Journal of the Acoustical Society of America, vol. 123, no. 3, pp. 1673-1682, 2008. https://doi.org/10.1121/1.2832617
Yuxuan Wang, Arun Narayanan, Deliang Wang, "On Training Target for Supervised Speech Separation. IEEE/ACM Trans Audio Speech Lang Process," vol. 22, no. 12, pp. 1849-1858, Dec 2014. https://doi.org/10.1109/TASLP.2014.2352935
G. Kim, Y. Lu, Y. Hu, and P. C. Loizou, "An algorithm that improves speech intelligibility in noise for normal-hearing listeners," The Journal of the Acoustical Society of America, vol. 126, pp. 1486-1494, 2009. https://doi.org/10.1121/1.3184603
H. Hermansky, "Perceptual linear predictive (PLP) analysis of speech," The Journal of the Acoustical Society of America, vol. 87, pp. 1738-1752, 1990. https://doi.org/10.1121/1.399423
H. Hermansky and N. Morgan, "RASTA processing of speech," IEEE Transactions on Speech, Audio Process, vol. 2, no. 4, pp. 578-589, Oct 1994. https://doi.org/10.1109/89.326616
Timo Gerkmann and Richard C Hendriks, "Unbiased mmse-based noise power eatimation with low complexity and low tracking delay," IEEE Transactions on Audio, Speech and Language Processing, vol. 20, no. 4, pp. 1383-1393, 2012. https://doi.org/10.1109/TASL.2011.2180896
Hinton G.E, Srivastava N, Krizhevsky A, Sutskever I and Salakhutdinov R, "Improving neural networks by preventing co-adaptation of feature detectors," Canada: Cornell University, [2013-07-3].
Miao Yajie, Metze Florian, "Improving Low-Resource CD-DNN-HMM using Dropout and Multilingual DNN Training," in Proc. of Proceedings of 14th Annual Conference of the International Speech Communication Association (INTERSPEECH). Lyon, France: ISCA, pp. 2237-2241, 2013.
Albensano, D., Gemello, R., Laface, P., Mana, F., Scanzio, S., "Adaptation of artificial neural networks avoiding catastrophic forgetting," in Proc. of International Conference on Neural Networks (IJCNN), pp. 1554-1561, 2006.
Li, X., Bilmes, J., "Regularized adaptation of discriminative classifiers," in Proc. of International Conference on Acoustics, Speech and Signal Processing (ICASSP), vol. 1, pp. I-I, 2006.
Stadermann, J., Rigoll, G., "Two-stage speaker adaptation of hybrid tied-posterior acoustic models," in Proc. of International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2005.
Yu, D., Yao, K., Su, H., Li, G., Seide, F., "Kl-divergence regularized deep neural network adaptation for improved large vocabulary speech recognition," in Proc. of International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7892-7897, 2013.
Tibshirani, R., "Regression shrinkage and selection via the lasso," Journal of the royal statistical society series b-statistical methodology, vol. 58, no. 1, pp. 267-288, 1996.
J. Stadermann and G. Rigoll, "Two-stage speaker adaptation of hybrid tied-posterior acoustic models," in Proc. of International Conference on Acoustics, Speech and Signal Processing (ICASSP), vol. I, pp. 997-1000, 2005.
Christine De Mol, Ernesto De Vito and Lorenzo Rosasco,"Elastic-net regularization in learning theory," Journal of Complexity, vol. 25, issue. 2, pp. 201-230, Apr 2009. https://doi.org/10.1016/j.jco.2009.01.002
Hui Zou and Trevor Hastie, "Regularization and variable selection via the elastic net," Journal of Royal Statistical Society, Series B, vol. 67, no. 2, pp. 301-320, 2005. https://doi.org/10.1111/j.1467-9868.2005.00503.x
C. Veaux, J. Yamagishi, and S. King, "The voice bank corpus: Design, collection and data analysis of a large regional accent speech database," in Proc. of IEEE International Committee for the Co-ordination and Standardization of Speech Databases and Assessment Techniques, held jointly with 2013 Conference on Asian Spoken Language Research and Evaluation (O-COCOSDA/CASLRE), pp. 1-4, 2013.
A.Varga and H. Steeneken, "Assessment for automatic speech recognition: II. NOISEX-92: A database and an experiment to study the effect of additive noise on speech recognition systems," Speech Commun., vol. 12, pp. 247-251, 1993. https://doi.org/10.1016/0167-6393(93)90095-3
J. Duchi, E. Hazan, and Y. Singer, "Adaptive subgradient methods for online learning and stochastic optimization," Journal of machine learning research., pp. 2121-2159, 2011.
C. Taal, R. Hendriks, R. Heusdens, and J. Jensen, "An algorithm for intelligibility prediction of time-frequency weighted noisy speech," IEEE Transactions on Audio, Speech, Lang. Process., vol. 19, no. 7, pp. 2125-2136, Sep 2011. https://doi.org/10.1109/TASL.2011.2114881
A. Rix, J. Beerends, M. Hollier, and A. Hekstra, "Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs," in Proc. of International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 749-752, 2001.
Cohen and B. Berdugo, "Speech enhancement for nonstationary noise environments," Signal processing, vol. 81, no. 11, pp. 2403-2418, 2001. https://doi.org/10.1016/S0165-1684(01)00128-1
R. Talmon and S. Gannot, "Single-channel transient interference suppression with diffusion maps," IEEE Transactions on Audio, Speech, and Language Process, vol. 21, no. 1, pp. 132-144, 2013. https://doi.org/10.1109/TASL.2012.2215593
Scott Pennock, "Accuracy of the perceptual evaluation of speech quality (pesq) algorithm," Measurement of Speech & Audio Quality in Networks Line Workshop Mesaqin', vol. 25, 2002.
Zechao Li, and Jinhui Tang, "Weakly-supervised Deep Matrix Factorization for Social Image Understanding. IEEE Transactions On Image Processing," pp.1-13, 2016.
Zechao Li and Jinhui Tang, "Weakly Supervised Deep Metric Learning for Community-Contributed Image Retrieval," IEEE Transactions On Multimedia, pp.1989-1999, 2015.

Cited by

A review of Chinese named entity recognition vol.15, pp.6, 2021, https://doi.org/10.3837/tiis.2021.06.004