Browse > Article

Compromised feature normalization method for deep neural network based speech recognition  

Kim, Min Sik (Department of Electronics Engineering, Pusan National University)
Kim, Hyung Soon (Department of Electronics Engineering, Pusan National University)
Publication Information
Phonetics and Speech Sciences / v.12, no.3, 2020 , pp. 65-71 More about this Journal
Feature normalization is a method to reduce the effect of environmental mismatch between the training and test conditions through the normalization of statistical characteristics of acoustic feature parameters. It demonstrates excellent performance improvement in the traditional Gaussian mixture model-hidden Markov model (GMM-HMM)-based speech recognition system. However, in a deep neural network (DNN)-based speech recognition system, minimizing the effects of environmental mismatch does not necessarily lead to the best performance improvement. In this paper, we attribute the cause of this phenomenon to information loss due to excessive feature normalization. We investigate whether there is a feature normalization method that maximizes the speech recognition performance by properly reducing the impact of environmental mismatch, while preserving useful information for training acoustic models. To this end, we introduce the mean and exponentiated variance normalization (MEVN), which is a compromise between the mean normalization (MN) and the mean and variance normalization (MVN), and compare the performance of DNN-based speech recognition system in noisy and reverberant environments according to the degree of variance normalization. Experimental results reveal that a slight performance improvement is obtained with the MEVN over the MN and the MVN, depending on the degree of variance normalization.
speech recognition; feature normalization; environmental mismatch; deep neural network;
Citations & Related Records
연도 인용수 순위
  • Reference
1 Atal, B. S. (1974). Effectiveness of linear prediction characteristics of the speech wave for automatic speaker identification and verification. The Journal of the Acoustical Society of America, 55(6), 1304-1312.   DOI
2 De La Torre, A., Peinado, A. M., Segura, J. C., Perez-Cordoba, J. L., Benitez, M. C., & Rubio, A. J. (2005). Histogram equalization of speech representation for robust speech recognition. IEEE Transactions on Speech and Audio Processing, 13(3), 355-366.   DOI
3 Deng, L., Li, J., Huang, J. T., Yao, K., Yu, D., Seide, F., Seltzer, M., Zweig, G., ... Gong, Y. (2013, May). Recent advances in deep learning for speech research at Microsoft. In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (pp. 8604-8608). Vancouver, BC.
4 Ioffe, S., & Szegedy, C. (2015, July). Batch normalization: Accelerating deep network training by reducing internal covariate shift. Proceedings of 32nd International Conference on Machine Learning (Vol. 37, pp. 448-456). Lille, France.
5 Kinoshita, K., Delcroix, M., Yoshioka, T., Nakatani, T., Habets, E., Haeb-Umbach, R., Leutnant, V., ... & Gannot, S. (2013, October). The REVERB challenge: A common evaluation framework for dereverberation and recognition of reverberant speech. In 2013 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (pp. 1-4). New Paltz, NY.
6 Li, J., Deng, L., Gong, Y., & Haeb-Umbach, R. (2014). An overview of noise-robust automatic speech recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 22(4), 745-777.   DOI
7 Molau, S., Hilger, F., & Ney, H. (2003, April). Feature space normalization in adverse acoustic conditions. In 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing 2003 Proceedings (ICASSP'03) (Vol. 1, pp. I-I). Hong Kong.
8 Viikki, O., Bye, D., & Laurila, K. (1998, May). A recursive feature vector normalization approach for robust speech recognition in noise. Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP'98 (Vol. 2, pp. 733-736). Seattle, WA.
9 Pearce, D., & Picone, J. (2002). Aurora working group: DSR front end LVCSR evaluation AU/384/02 (Technical report). Mississippi State, MS; Institute for Signal and Information Processing at Mississippi State University.
10 Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., Hannemann, M., ... Vesely, K. (2011). The Kaldi speech recognition toolkit. In IEEE 2011 workshop on automatic speech recognition and understanding (No. CONF). Hawaii, HI.
11 Yu, D., Seltzer, M. L., Li, J., Huang, J. T., & Seide, F. (2013, March). Feature learning in deep neural networks - studies on speech recognition tasks. Proceedings of International Conference on Learning Representations(ICLR). Scottsdale, AZ.