Browse > Article
http://dx.doi.org/10.13064/KSSS.2019.11.1.041

Visual analysis of attention-based end-to-end speech recognition  

Lim, Seongmin (School of Electrical Engineering, KAIST)
Goo, Jahyun (School of Electrical Engineering, KAIST)
Kim, Hoirin (School of Electrical Engineering, KAIST)
Publication Information
Phonetics and Speech Sciences / v.11, no.1, 2019 , pp. 41-49 More about this Journal
Abstract
An end-to-end speech recognition model consisting of a single integrated neural network model was recently proposed. The end-to-end model does not need several training steps, and its structure is easy to understand. However, it is difficult to understand how the model recognizes speech internally. In this paper, we visualized and analyzed the attention-based end-to-end model to elucidate its internal mechanisms. We compared the acoustic model of the BLSTM-HMM hybrid model with the encoder of the end-to-end model, and visualized them using t-SNE to examine the difference between neural network layers. As a result, we were able to delineate the difference between the acoustic model and the end-to-end model encoder. Additionally, we analyzed the decoder of the end-to-end model from a language model perspective. Finally, we found that improving end-to-end model decoder is necessary to yield higher performance.
Keywords
speech recognition; end-to-end; t-SNE; sequence-to-sequence;
Citations & Related Records
연도 인용수 순위
  • Reference
1 Abdel-Hamid, O., Mohamed, A. R., Jiang, H., & Penn, G. (2012, March). Applying convolutional neural networks concepts to hybrid NN-HMM model for speech recognition. Proceedings of the 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 4277-4280).
2 Chorowski, J., Bahdanau, D., Cho, K., & Bengio, Y. (2014). End-to-end continuous speech recognition using attention-based recurrent NN: First results [Computing Research Repository]. Retrieved from http://arxiv.org/abs/1412.1602
3 Dayhoff, J. E., & DeLeo, J. M. (2001). Artificial neural networks: Opening the black box. Cancer: Interdisciplinary International Journal of the American Cancer Society, 91(S8), 1615-1635.
4 Graves, A., Fernandez, S., Gomez, F., & Schmidhuber, J. (2006, June). Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks. Proceedings of the 23rd International Conference on Machine Learning (pp. 369-376).
5 Graves, A., Mohamed, A. R., & Hinton, G. (2013, May). Speech recognition with deep recurrent neural networks. Proceedings of the 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 6645-6649).
6 Gunning, D. (2017). Explainable artificial intelligence (XAI). Defense Advanced Research Projects Agency (DARPA).
7 Hinton, G., Deng, L., Yu, D., Dahl, G., Mohamed, A. R., Jaitly, N., Senior, A., ... Kingsbury, B. (2012). Deep neural networks for acoustic modeling in speech recognition. IEEE Signal Processing Magazine, 29(6), 82-97.   DOI
8 LeCun, Y., Cortes, C., & Burges, C. J. C. (1998). The MNIST database of handwritten digits. Retrieved from http://yann.lecun.com/exdb/mnist
9 Miao, Y., Gowayyed, M., & Metze, F. (2015, October). EESEN: End-to-end speech recognition using deep RNN models and WFST-based decoding. Proceedings of the 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU) (pp. 167-174).
10 Chan, W., Jaitly, N., Le, Q., & Vinyals, O. (2016, March). Listen, attend and spell: A neural network for large vocabulary conversational speech recognition. Proceedings of the 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 4960-4964).
11 Watanabe, S., Hori, T., Karita, S., Hayashi, T., Nishitoba, J., Unno, Y., Enrique Yalta Soplin, N., ... Ochiai, T. (2018). ESPnet: End-to-End Speech Processing Toolkit [Computing Research Repository]. Retrieved from http://arxiv.org/abs/1804.00015
12 Mikolov, T., Karafiat, M., Burget, L., Cernocky, J., & Khudanpur, S. (2010). Recurrent neural network based language model. Proceedings of the 11th Annual Conference of the International Speech Communication Association (pp. 1045-1048).
13 Panayotov, V., Chen, G., Povey, D., & Khudanpur, S. (2015, April). Librispeech: An ASR corpus based on public domain audio books. Proceedings of the 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 5206-5210).
14 Sriram, A., Jun, H., Satheesh, S., & Coates, A. (2017). Cold fusion: Training seq2seq models together with language models [Computing Research Repository]. Retrieved from http://arxiv.org/abs/1708.06426
15 Sutskever, I., Vinyals, O., & Le, Q. V. (2014). Sequence to sequence learning with neural networks. Proceedings of the Advances in Neural Information Processing Systems (pp. 3104-3112).