[KSCI] Korea Science Citation Index Service

http://dx.doi.org/10.7236/IJIBC.2020.12.3.220

Non-Intrusive Speech Intelligibility Estimation Using Autoencoder Features with Background Noise Information

Jeong, Yue Ri (Dept. of Electronic and IT Media Engineering, Seoul National University of Science and Technology)
Choi, Seung Ho (Dept. of Electronic and IT Media Engineering, Seoul National University of Science and Technology)

Publication Information

International Journal of Internet, Broadcasting and Communication / v.12, no.3, 2020 , pp. 220-225 More about this Journal

Abstract

This paper investigates the non-intrusive speech intelligibility estimation method in noise environments when the bottleneck feature of autoencoder is used as an input to a neural network. The bottleneck feature-based method has the problem of severe performance degradation when the noise environment is changed. In order to overcome this problem, we propose a novel non-intrusive speech intelligibility estimation method that adds the noise environment information along with bottleneck feature to the input of long short-term memory (LSTM) neural network whose output is a short-time objective intelligence (STOI) score that is a standard tool for measuring intrusive speech intelligibility with reference speech signals. From the experiments in various noise environments, the proposed method showed improved performance when the noise environment is same. In particular, the performance was significant improved compared to that of the conventional methods in different environments. Therefore, we can conclude that the method proposed in this paper can be successfully used for estimating non-intrusive speech intelligibility in various noise environments.

Keywords

Non-intrusive; Speech intelligibility estimation; noise environment; Autoencoder; Bottleneck feature; Long short-term memory (LSTM); STOI;

Citations & Related Records

Reference

1	Ludovic Malfait, Jens Berger, and Martin Kastner, "P.563 -The ITU-T standard for single-ended speech quality assessment," IEEE Transactions on Audio, Speech, and Language Processing 14.6, pp.1924-1934, 2006. DOI: 10.1109/TASL.2006.883177 DOI
2	C. H. Taal, R. C. Hendriks, R. Heusdens, and J. Jensen, "An algorithm for intelligibility prediction of time-frequency weighted noisy speech," IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, no. 7, pp. 2125- 2136, 2011. DOI: https://www.doi.org/10.1109/TASL.2011.2114881 DOI
3	Dushyant Sharma, Yu Wang, Patrick A. Naylor, Mike Brookes, "A data-driven non-intrusive measure of speech quality and intelligibility," Speech Communication, vol. 80, June 2016, pp. 84-94, June 2016. DOI: https://doi.org/10.1016/j.specom.2016.03.005 DOI
4	Anderson R. Avila, Hannes Gamper, Chandan Reddy, Ross Cutler, Ivan Tashev, and Johannes Gehrke, "Nonintrusive Speech Quality Assessment Using Neural Networks," IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 18777982, May 2019. DOI: 10.1109/ICASSP.2019.8683175
5	D. K. Yun, H. N. Lee, and S. H. Choi, "A Deep Learning-Based Approach to Non-Intrusive Speech Intelligibility Estimation," IEICE Trans. Information and Systems, pp. 1207-1208, Apr. 2018. DOI: 10.1587/transinf.2017EDL8225 DOI
6	Y. H. Kim, D. K. Yun, H. N. Lee, and S. H. Choi, "A Non-Intrusive Speech Intelligibility Estimation Method Based on Deep Learning Using Autoencoder Features" IEICE Trans. Information and Systems, Vol.E103-D No.3, March. 2020. DOI: 10.1587/transinf.2019EDL8150
7	S. Hochreiter and J. Schmidhuber, "Long short-term memory," Neural Computation, vol. 9, no. 8, pp. 1735-1780, Nov. 1997. DOI: 10.1162/neco.1997.9.8.1735 DOI
8	Hasim Sak, Andrew W. Senior, and Françoise Beaufays, "Long Short-Term Memory Recurrent Neural Network Architectures for Large Scale Acoustic Modeling models," Proc. INTERSPEECH, pp. 338-342, 2014.
9	Tara N. Sainath, Brian Kingsbury, and Bhuvana Ramab, "Auto-encoder bottleneck features using deep belief networks," Proc. ICASSP, pp. 4153-4156, 2012. DOI: 10.1109/ICASSP.2012.6288833
10	V. Nair and G. E. Hinton, "Rectified linear units improve restricted Boltzmann machines," Proc. of the 27th international conference on machine learning (ICML-10), pp. 807-814. 2010. DOI: https://dl.acm.org/citation.cfm?id=3104425
11	Diederik P. Kingma and Jimmy Ba, "Adam: A method for stochastic optimization," arXiv preprint arXiv:1412.6980, 2014. DOI: https://arxiv.org/abs/1412.6980
12	J. S. Garofolo, L. F. Lamel, W. M. Fisher, J. G. Fiscus, and D. S. Pallett, "DARPA TIMIT acoustic phonetic continuous speech corpus CDROM," NIST, 1993.
13	A. H. Andersen, J. M. de Haan, Z. tan and J. Jensen, "Nonintrusive Speech Intelligibility Prediction Using Convolutional Neural Networks," IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 26, no. 10, pp. 1925-1939, Oct. 2018. DOI: 10.1109/TASLP.2018.2847459 DOI