[KSCI] Korea Science Citation Index Service

http://dx.doi.org/10.3837/tiis.2018.10.017

An Adaptation Method in Noise Mismatch Conditions for DNN-based Speech Enhancement

Xu, Si-Ying (National Digital Switching System Engineering & Technological R&D Center)
Niu, Tong (National Digital Switching System Engineering & Technological R&D Center)
Qu, Dan (National Digital Switching System Engineering & Technological R&D Center)
Long, Xing-Yan (National Digital Switching System Engineering & Technological R&D Center)

Publication Information

KSII Transactions on Internet and Information Systems (TIIS) / v.12, no.10, 2018 , pp. 4930-4951 More about this Journal

Abstract

The deep learning based speech enhancement has shown considerable success. However, it still suffers performance degradation under mismatch conditions. In this paper, an adaptation method is proposed to improve the performance under noise mismatch conditions. Firstly, we advise a noise aware training by supplying identity vectors (i-vectors) as parallel input features to adapt deep neural network (DNN) acoustic models with the target noise. Secondly, given a small amount of adaptation data, the noise-dependent DNN is obtained by using $L_2$ regularization from a noise-independent DNN, and forcing the estimated masks to be close to the unadapted condition. Finally, experiments were carried out on different noise and SNR conditions, and the proposed method has achieved significantly 0.1%-9.6% benefits of STOI, and provided consistent improvement in PESQ and segSNR against the baseline systems.

Keywords

Noise-aware Training; identity-vector; L2 $L_2$ regularization; speech enhancement; DNN; condition mismatch;

Citations & Related Records

Reference

1	Philipos. C. Loizou, "Speech Enhancement: Theory and Practice," 2nd ed. Boca Raton, FL, USA: CRC Press, Inc., 2013.
2	Steven Boll. "Suppression of acoustic noise in speech using spectral subtraction," IEEE Transactions on Acoustics, Speech and Signal Processing, vol. 27, no. 2, pp. 113-122, 1979. DOI
3	J. Lim and A. Oppenheim, "All-pole modeling of degraded speech," IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 26, no. 3, pp. 197-210, Jun 1978. DOI
4	Y. Ephraim, "Statistical-model-based speech enhancement systems," in Proc. of Proceedings of the IEEE, vol. 80, no. 10, pp. 1526-1555, Oct 1992. DOI
5	Kevin W Wilson, Bhiksha Raj, and Paris Smaragdis, "Regularized non-negative matrix factorization with temporal dependencies for speech denoising," Interspeech, pp. 411-414, 2008a.
6	Y. Bengio, "Learning deep architectures for AI," Found. Trends Mach. Learn., vol. 2, no. 1, pp. 1-127, 2009. DOI
7	Xugang Lu, Yu Tsao, Shigeki Matsuda, and Chiori Hori, "Speech enhancement based on deep denoising autoencoder," INTERSPEECH, pp. 436-440, 2013.
8	Bing-yin Xia and Chang-chun Bao, "Speech enhancement with weighted denoising auto-encoder," INTERSPEECH, pp. 3444-3448, 2013.
9	Bingyin Xia and Changchun Bao, "Wiener filtering based speech enhancement with weighted denoising auto-encoder and noise classification," Speech Communication, vol. 60, pp.13-29, 2014. DOI
10	D. L. Wang and G. J. Brown, "Computational auditory scene analysis: Principles, algorithms, and applications," Wiley-IEEE Press, 2006.
11	Kim, D.Y., Kwan Un, C., Kim, N.S., "Speech recognition in noisy environments using first-order vector Taylor Series," Speech Communication, vol. 24, no. 1, pp. 39-49, 1998. DOI
12	Li, J., Deng, L., Yu, D., Gonf, Y., Acero, A., "A unified framework of HMM adaptation with joint compensation of additive and convolutive distortions via vector Taylor series," in Proc. of IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pp. 65-70, 2007.
13	Li, J., Deng, L., Yu, D., Gong, Y., Acero, A., "HMM adaptation using a phase-sensitive acoustic distortion model for environment-robust speech recognition," in Proc. of International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4069-4072, 2008.
14	Moreno, P.J., Raj, B., Stern, R.M., "A vector Taylor series approach for environment-independent speech recognition," in Proc. of International Conference on Acoustics, Speech and Signal Processing(ICASSP), pp. 733-736, 1996.
15	Seide, F., Li, G., Chen, X., Yu, D., "Feature engineering in context-dependent deep neural networks for conversational speech transcription," in Proc. of IEEE Workshop on Automatic Speech Recognition and Understanding (ARSU), pp. 24-29, 2011.
16	Seltzer, M., Yu, D., Wang, Y., "An investigation of deep neural networks for noise robust speech recognition," in Proc. of International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2013.
17	Yu, D., Seltzer, M.L., Li, J., Huang, J.T., Seide, F., "Feature learning in deep neural networks-studied on speech recognition tasks," in Proc. of International Conference on Learning Representation (ICLR), 2013.
18	Abdel-Hamid, O., Jiang, H., "Fast speaker adaptation of hybrid NN/HMM model for speech recognition based on discriminative learning of speaker code," in Proc. of International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7942-7946, 2013.
19	D. K. Kim and N. S. Kim, "Baysian speaker adaptation based on probabilistic principal component analysis," INTERSPEECH, pp. 734-737, 2000.
20	Shaofei Xue, Ossama Abdel-Halmid, Hui Jiang, Lirong Dai, "Direct Adaptation of Hybrid DNN/HMM Model for Fast Speaker Adaptation in LVCSR Based on Speaker Code," in Proc. of International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6389-6393, 2013.
21	George Saon, Hagen Soltau, David Nahamoo, and Michael Picheny, "Speaker adaptation of neural network acoustic models using i-vectors," in Proc. of IEEE Workshop on Automatic Speech Recognition and Understanding (ARSU), pp. 55-59, 2013.
22	Dehak, N., Kenny, P., Dehak, R., Dumouchel, P., Ouellet, P., "Front-end factor analysis for speaker verification," IEEE Transactions on Audio, Speech and Language Processing, vol. 19, no. 4, pp. 788-798, 2011. DOI
23	Glembek, O., Burget, L., Matejka, P., Karafiat, M., Kenny, P., "Simplification and optimization of i-vector extraction," in Proc. of International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4515-4519, 2011.
24	Wang D, "On ideal binary mask as the computational goal of auditory scene analysis," Divenyi, P., editor. Speech Separation by Humans and Machines. Norwell, MA, USA: Kluwer: pp.181-197, 2005.
25	Christine De Mol, Ernesto De Vito and Lorenzo Rosasco,"Elastic-net regularization in learning theory," Journal of Complexity, vol. 25, issue. 2, pp. 201-230, Apr 2009. DOI
26	Yu, D., Yao, K., Su, H., Li, G., Seide, F., "Kl-divergence regularized deep neural network adaptation for improved large vocabulary speech recognition," in Proc. of International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7892-7897, 2013.
27	Tibshirani, R., "Regression shrinkage and selection via the lasso," Journal of the royal statistical society series b-statistical methodology, vol. 58, no. 1, pp. 267-288, 1996.
28	J. Stadermann and G. Rigoll, "Two-stage speaker adaptation of hybrid tied-posterior acoustic models," in Proc. of International Conference on Acoustics, Speech and Signal Processing (ICASSP), vol. I, pp. 997-1000, 2005.
29	Li N, Loizou P, "Factors influencing intelligibility of ideal binary masked speech: Implications for noise reduction," The Journal of the Acoustical Society of America, vol. 123, no. 3, pp. 1673-1682, 2008. DOI
30	Kjems U, Boldt, J, Pedersen M, Lunner T, Wang D, "Role of mask pattern in intelligibility of ideal binary-masked noisy speech," The Journal of the Acoustical Society of America, vol. 126, pp. 1415-1426, 2009. DOI
31	Yuxuan Wang, Arun Narayanan, Deliang Wang, "On Training Target for Supervised Speech Separation. IEEE/ACM Trans Audio Speech Lang Process," vol. 22, no. 12, pp. 1849-1858, Dec 2014. DOI
32	J. Duchi, E. Hazan, and Y. Singer, "Adaptive subgradient methods for online learning and stochastic optimization," Journal of machine learning research., pp. 2121-2159, 2011.
33	Hui Zou and Trevor Hastie, "Regularization and variable selection via the elastic net," Journal of Royal Statistical Society, Series B, vol. 67, no. 2, pp. 301-320, 2005. DOI
34	C. Veaux, J. Yamagishi, and S. King, "The voice bank corpus: Design, collection and data analysis of a large regional accent speech database," in Proc. of IEEE International Committee for the Co-ordination and Standardization of Speech Databases and Assessment Techniques, held jointly with 2013 Conference on Asian Spoken Language Research and Evaluation (O-COCOSDA/CASLRE), pp. 1-4, 2013.
35	A.Varga and H. Steeneken, "Assessment for automatic speech recognition: II. NOISEX-92: A database and an experiment to study the effect of additive noise on speech recognition systems," Speech Commun., vol. 12, pp. 247-251, 1993. DOI
36	A. Rix, J. Beerends, M. Hollier, and A. Hekstra, "Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs," in Proc. of International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 749-752, 2001.
37	G. Kim, Y. Lu, Y. Hu, and P. C. Loizou, "An algorithm that improves speech intelligibility in noise for normal-hearing listeners," The Journal of the Acoustical Society of America, vol. 126, pp. 1486-1494, 2009. DOI
38	H. Hermansky, "Perceptual linear predictive (PLP) analysis of speech," The Journal of the Acoustical Society of America, vol. 87, pp. 1738-1752, 1990. DOI
39	H. Hermansky and N. Morgan, "RASTA processing of speech," IEEE Transactions on Speech, Audio Process, vol. 2, no. 4, pp. 578-589, Oct 1994. DOI
40	C. Taal, R. Hendriks, R. Heusdens, and J. Jensen, "An algorithm for intelligibility prediction of time-frequency weighted noisy speech," IEEE Transactions on Audio, Speech, Lang. Process., vol. 19, no. 7, pp. 2125-2136, Sep 2011. DOI
41	Cohen and B. Berdugo, "Speech enhancement for nonstationary noise environments," Signal processing, vol. 81, no. 11, pp. 2403-2418, 2001. DOI
42	Albensano, D., Gemello, R., Laface, P., Mana, F., Scanzio, S., "Adaptation of artificial neural networks avoiding catastrophic forgetting," in Proc. of International Conference on Neural Networks (IJCNN), pp. 1554-1561, 2006.
43	Timo Gerkmann and Richard C Hendriks, "Unbiased mmse-based noise power eatimation with low complexity and low tracking delay," IEEE Transactions on Audio, Speech and Language Processing, vol. 20, no. 4, pp. 1383-1393, 2012. DOI
44	Hinton G.E, Srivastava N, Krizhevsky A, Sutskever I and Salakhutdinov R, "Improving neural networks by preventing co-adaptation of feature detectors," Canada: Cornell University, [2013-07-3].
45	Miao Yajie, Metze Florian, "Improving Low-Resource CD-DNN-HMM using Dropout and Multilingual DNN Training," in Proc. of Proceedings of 14th Annual Conference of the International Speech Communication Association (INTERSPEECH). Lyon, France: ISCA, pp. 2237-2241, 2013.
46	Li, X., Bilmes, J., "Regularized adaptation of discriminative classifiers," in Proc. of International Conference on Acoustics, Speech and Signal Processing (ICASSP), vol. 1, pp. I-I, 2006.
47	Stadermann, J., Rigoll, G., "Two-stage speaker adaptation of hybrid tied-posterior acoustic models," in Proc. of International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2005.
48	Zechao Li, and Jinhui Tang, "Weakly-supervised Deep Matrix Factorization for Social Image Understanding. IEEE Transactions On Image Processing," pp.1-13, 2016.
49	R. Talmon and S. Gannot, "Single-channel transient interference suppression with diffusion maps," IEEE Transactions on Audio, Speech, and Language Process, vol. 21, no. 1, pp. 132-144, 2013. DOI
50	Scott Pennock, "Accuracy of the perceptual evaluation of speech quality (pesq) algorithm," Measurement of Speech & Audio Quality in Networks Line Workshop Mesaqin', vol. 25, 2002.
51	Zechao Li and Jinhui Tang, "Weakly Supervised Deep Metric Learning for Community-Contributed Image Retrieval," IEEE Transactions On Multimedia, pp.1989-1999, 2015.