DOI QR코드

DOI QR Code

Semi-supervised learning of speech recognizers based on variational autoencoder and unsupervised data augmentation

변분 오토인코더와 비교사 데이터 증강을 이용한 음성인식기 준지도 학습

  • 조현호 (충북대학교 지능로봇공학과) ;
  • 강병옥 (한국전자통신연구원 인공지능연구소) ;
  • 권오욱 (충북대학교 지능로봇공학과)
  • Received : 2021.09.28
  • Accepted : 2021.11.04
  • Published : 2021.11.30

Abstract

We propose a semi-supervised learning method based on Variational AutoEncoder (VAE) and Unsupervised Data Augmentation (UDA) to improve the performance of an end-to-end speech recognizer. In the proposed method, first, the VAE-based augmentation model and the baseline end-to-end speech recognizer are trained using the original speech data. Then, the baseline end-to-end speech recognizer is trained again using data augmented from the learned augmentation model. Finally, the learned augmentation model and end-to-end speech recognizer are re-learned using the UDA-based semi-supervised learning method. As a result of the computer simulation, the augmentation model is shown to improve the Word Error Rate (WER) of the baseline end-to-end speech recognizer, and further improve its performance by combining it with the UDA-based learning method.

종단간 음성인식기의 성능향상을 위한 변분 오토인코더(Variational AutoEncoder, VAE) 및 비교사 데이터 증강(Unsupervised Data Augmentation, UDA) 기반의 준지도 학습 방법을 제안한다. 제안된 방법에서는 먼저 원래의 음성데이터를 이용하여 VAE 기반 증강모델과 베이스라인 종단간 음성인식기를 학습한다. 그 다음, 학습된 증강모델로부터 증강된 데이터를 이용하여 베이스라인 종단간 음성인식기를 다시 학습한다. 마지막으로, 학습된 증강모델 및 종단간 음성인식기를 비교사 데이터 증강 기반의 준지도 학습 방법으로 다시 학습한다. 컴퓨터 모의실험 결과, 증강모델은 기존의 종단간 음성인식기의 단어오류율(Word Error Rate, WER)을 개선하였으며, 비교사 데이터 증강학습방법과 결합함으로써 성능을 더욱 개선하였다.

Keywords

Acknowledgement

이 논문은 2019년도 정부(과학기술정보통신부)의 재원으로 정보통신기획평가원의지원을 받아 수행된 연구임(2019-0-00004, 준지도학습형 언어지능 원천기술 및 이에 기반한 외국인 지원용 한국어 튜터링 서비스 개발).

References

  1. F. Seide, G. Li, and D. Yu, "Conversational speech transcription using context-dependent deep neural networks," Proc. INTERSPEECH, 437-440 (2011).
  2. W. Chan, N. Jaitly, Q. Le, and O. Vinyals, "Listen, attend and spell: A neural network for large vocabulary conversational speech recognition," Proc. ICASSP. 4960-4964 (2016).
  3. A. Vaswami, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, "Attention is all you need," Proc. NIPS. 5998-6008 (2017).
  4. T. Hori, R. Astudillo, T. Hayashi, Y. Zhang, S. Watanabe, and J. L. Roux, "Cycle-consistency training for end-to-end speech recognition," Proc. ICASSP. 6271-6275 (2019).
  5. M.-K. Baskar, S. Watanabe, R. Astudillo, T. Hori, L. Burget, and J. Cernocky, "Semi-supervised sequence-to-sequence ASR using unpaired speech and text," Proc. ICASSP. 3790-3794 (2019).
  6. Q. Xie, Z. Dai, E. Hovy, M. T. Luong, and Q. V. Le, "Unsupervised data augmentation for consistency training," arXiv:1904.12848 (2019).
  7. J. Li, M. L. Seltzer, X. Wang, R. Zhao, and Y. Gong, "Large-scale domain adaptation via teacher-student learning," Proc. INTERSPEECH, 2386-2390 (2017).
  8. Q. Xie, M.-T. Luong, E. Hovy, and Q. V. Le, "Self-training with noisy student improves ImageNet classification," Proc. CVPR. 10687-10698 (2020).
  9. N. Jaitly and G. E. Hinton, "Vocal tract length perturbation (VTLP) improves speech recognition," Proc. ICML. 625-660 (2013).
  10. D. S. Park, W. Chan, Y. Zhang, C.-C. Chiu, B. Zoph, E. D. Cubuk, and Q. V. Le, "SpecAugment: A simple data augmentation method for automatic speech recognition," Proc. INTERSPEECH, 2613-2617 (2019).
  11. X. Song, Z. Wu, Y. Huang, D. Su, and H. Meng, "SpecSwap: A simple data augmentation method for end-to-end speech recognition," Proc. INTERSPEECH, 581-585 (2020).
  12. D. P. Kingma and M. Welling, "Auto-encoding variational bayes," Proc. ICLR. 1-14 (2014).
  13. D. B. Paul and J. M. Baker, "The design for the Wall Street Journal-based CSR corpus," Proc. ACL. 357-362 (1992).
  14. V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, "LibriSpeech: An ASR corpus based on public domain audio books," Proc. ICASSP. 5206-5210 (2015).
  15. S. Watanabe, T. Hori, S. Karita, T. Hayashi, J. Nishitoba, Y. Unno, N. E. Y. Soplin, J. Heymann, M. Wiesner, N. Chen, A. Renduchintala, and T. Ochiai, "ESPnet: End-to-end speech processing toolkit," Proc. INTERSPEECH, 2207-2211 (2018).
  16. L. V. D. Maaten and G. Hinton, "Visualizing data using t-SNE," J. Mach. Learn. Res. 9, 2579-2605 (2008).