DOI QR코드

DOI QR Code

음성 인식을 위한 sequence-to-sequence 심층 신경망의 이중 attention 기법

Double-attention mechanism of sequence-to-sequence deep neural networks for automatic speech recognition

  • Yook, Dongsuk (Artificial Intelligence Laboratory, Department of Computer Science and Engineering, Korea University) ;
  • Lim, Dan (Kakao Corp.) ;
  • Yoo, In-Chul (Artificial Intelligence Laboratory, Department of Computer Science and Engineering, Korea University)
  • 투고 : 2020.07.31
  • 심사 : 2020.09.14
  • 발행 : 2020.09.30

초록

입력열과 출력열의 길이가 다른 경우 attention 기법을 이용한 sequence-to-sequence 심층 신경망이 우수한 성능을 보인다. 그러나, 출력열의 길이에 비해서 입력열의 길이가 너무 긴 경우, 그리고 하나의 출력값에 해당하는 입력열의 특성이 변화하는 경우, 하나의 문맥 벡터(context vector)를 사용하는 기존의 attention 방법은 적당하지 않을 수 있다. 본 논문에서는 이러한 문제를 해결하기 위해서 입력열의 왼쪽 부분과 오른쪽 부분을 각각 개별적으로 처리할 수 있는 두 개의 문맥 벡터를 사용하는 이중 attention 기법을 제안한다. 제안한 방법의 효율성은 TIMIT 데이터를 사용한 음성 인식 실험을 통하여 검증하였다.

Sequence-to-sequence deep neural networks with attention mechanisms have shown superior performance across various domains, where the sizes of the input and the output sequences may differ. However, if the input sequences are much longer than the output sequences, and the characteristic of the input sequence changes within a single output token, the conventional attention mechanisms are inappropriate, because only a single context vector is used for each output token. In this paper, we propose a double-attention mechanism to handle this problem by using two context vectors that cover the left and the right parts of the input focus separately. The effectiveness of the proposed method is evaluated using speech recognition experiments on the TIMIT corpus.

키워드

참고문헌

  1. I. Sutskever, O. Vinyals, and Q. Le, "Sequence to sequence learning with neural networks," Proc. Int. Conf. NIPS. 3104-3112 (2014).
  2. D. Bahdanau, K. Cho, and Y. Bengio, "Neural machine translation by jointly learning to align and translate," arXiv:1409.0473 (2014).
  3. K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhutdinov, R. Zemel, and Y. Bengio, "Show, attend and tell: neural image caption generation with visual attention," Proc. ICML. 2048-2057 (2015).
  4. S. Watanabe, T. Hori, S. Kim, J. Hershey, and T. Hayashi, "Hybrid CTC/attention architecture for endto-end speech recognition," IEEE J. Selected Topics in Signal Processing, 11, 1240-1253 (2017). https://doi.org/10.1109/JSTSP.2017.2763455
  5. H. Soltau, H. Liao, and H. Sak, "Neural speech recognizer: acoustic-to-word LSTM model for large vocabulary speech recognition," Proc. Interspeech, 3707-3711 (2017).
  6. K. Audhkhasi, B. Kingsbury, B. Ramabhadran, G. Saon, and M. Picheny, "Building competitive direct acoustics-to-word models for English conversational speech recognition," Proc. IEEE ICASSP. 4759-4763 (2018).
  7. C. Chiu, T. Sainath, Y. Wu, R. Prabhavalkar, P. Nguyen, Z. Chen, A. Kannan, R. Weiss, K. Rao, E. Gonina, N. Jaitly, B. Li, J. Chorowski, and M. Bacchiani, "State-of-the-art speech recognition with sequence-to-sequence models," Proc. IEEE ICASSP. 4774-4778 (2018).
  8. J. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y. Bengio, "Attention-based models for speech recognition," Proc. Int. Conf. NIPS. 577-585 (2015).
  9. W. Chan, N. Jaitly, Q. Le, and O. Vinyals, "Listen, attend and spell: a neural network for large vocabulary conversational speech recognition," Proc. IEEE ICASSP. 4960-4964 (2016).
  10. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. Gomez, L. Kaizer, and I. Polosukhin, "Attention is all you need," Proc. Int. Conf. NIPS. 5998-6008 (2017).
  11. S. Hochreiter and J. Schmidhuber, "Long short-term memory," Neural Computation, 9, 1735-1780 (1997). https://doi.org/10.1162/neco.1997.9.8.1735
  12. K. Greff, R. Srivastava, J. Koutnik, B. Steunebrink, and J. Schmidhuber, "LSTM: a search space odyssey," IEEE Trans. on Neural Networks and Learning Systems, 28, 2222-2232 (2017). https://doi.org/10.1109/TNNLS.2016.2582924
  13. Y. LeCun and Y. Bengio, "Convolutional networks for images, speech, and time-series," in Handbook of Brain Theory and Neural Networks, edited by M. A. Arbib (MIT Press, 1995).
  14. O. Abdel-Hamid, A. Mohamed, H. Jiang, L. Deng, G. Penn, and D. Yu, "Convolutional neural networks for speech recognition," IEEE/ACM Trans. on Audio, Speech, and Language Processing, 22, 1533-1545 (2014). https://doi.org/10.1109/TASLP.2014.2339736
  15. D. Lim, Improving seq2seq by revising attention mechanism for speech recognition, (Dissertation, Korea University, 2018).
  16. Y. Zhang, W. Chan, and N. Jaitly, "Very deep convolutional networks for end-to-end speech recognition," Proc. IEEE ICASSP. 4845-4849 (2017).