DOI QR코드

DOI QR Code

AI-based language tutoring systems with end-to-end automatic speech recognition and proficiency evaluation

  • Byung Ok Kang (Integrated Intelligence Research Section, Electronics and Telecommunications Research Institute) ;
  • Hyung-Bae Jeon (Integrated Intelligence Research Section, Electronics and Telecommunications Research Institute) ;
  • Yun Kyung Lee (Soundustry Inc.)
  • 투고 : 2023.08.10
  • 심사 : 2023.12.20
  • 발행 : 2024.02.20

초록

This paper presents the development of language tutoring systems for nonnative speakers by leveraging advanced end-to-end automatic speech recognition (ASR) and proficiency evaluation. Given the frequent errors in non-native speech, high-performance spontaneous speech recognition must be applied. Our systems accurately evaluate pronunciation and speaking fluency and provide feedback on errors by relying on precise transcriptions. End-to-end ASR is implemented and enhanced by using diverse non-native speaker speech data for model training. For performance enhancement, we combine semisupervised and transfer learning techniques using labeled and unlabeled speech data. Automatic proficiency evaluation is performed by a model trained to maximize the statistical correlation between the fluency score manually determined by a human expert and a calculated fluency score. We developed an English tutoring system for Korean elementary students called EBS AI Peng-Talk and a Korean tutoring system for foreigners called KSI Korean AI Tutor. Both systems were deployed by South Korean government agencies.

키워드

과제정보

This work is funded by the Institute of Information & Communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT), Grant/Award Number: 2019-0-00004.

참고문헌

  1. W. J. Ha and H. Choi, Systematic review for AI-based language learning tools, arXiv Preprint (2021), DOI 10.48550/arXiv.2111.04455.
  2. Y. Gong, Z. Chen, I. H. Chu, P. Chang, and J. Glass, Transformer-based multi-aspect multi-granularity non-native English speaker pronunciation assessment, (IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP, Singapore), 2022, pp. 7262-7266.
  3. O.-W. Kwon, K.-Y. Lee, Y.-H. Roh, H.-X. Huang, S.-K. Choi, Y.- K. Kim, H. B. Jeon, Y. R. Oh, Y.-K. Lee, B. O. Kang, E. Chung, J. G. Park, and Y. Lee, GenieTutor: A computer-assisted second language learning system based on spoken language understanding, In Natural language dialog systems and intelligent assistants, Springer, Cham, Switzerland, 2015, pp. 257-262.
  4. Y. K. Lee and J. G. Park, Multimodal unsupervised speech translation for recognizing and evaluating second language speech, Appl. Sci. 11 (2021), 2642.
  5. S. Bibauw, T. Francois, and P. Desmet, Discussing with a computer to practice a foreign language: research synthesis and conceptual framework of dialogue-based CALL, Comput. Assist. Lang. Learn. 32 (2021), 827-877. https://doi.org/10.1080/09588221.2018.1535508
  6. L. Chen, K. Zechner, S. Y. Yoon, K. Evanini, X. Wang, A. Loukina, J. Tao, L. Davis, C. M. Lee, M. Ma, and R. Mundkowsky, Automated scoring of nonnative speech using the SpeechRaterSM v. 5.0 engine, ETS Res. Rep. Ser. (2018), 1-31.
  7. A. Kholis, Elsa speak app: automatic speech recognition (ASR) for supplementing English pronunciation skills, Engl. Lang. Teach. 9 (2021), 1-14.
  8. M. F. Sholekhah and R. Fakhrurriana, The use of ELSA speak as a mobile-assisted language learning (MALL) towards EFL students' pronunciation, J. Educ. Lang. Innov. Appl. Linguist. 2 (2023), 93-100.
  9. Y. R. Oh, K. Y. Park, H. B. Jeon, and J. G. Park, Automatic proficiency assessment of Korean speech read aloud by non-natives using bidirectional LSTM based speech recognition, ETRI J. 42 (2020), 761-772. https://doi.org/10.4218/etrij.2019-0400
  10. Y. Hayashi, Y. Kondo, and Y. Ishii, Automated speech scoring of dialogue response by Japanese learners of English as a foreign language, Innov. Lang. Learn. Teach. (2023), 1-15.
  11. A. Graves, N. Jaitly, and A. R. Mohamed, Hybrid speech recognition with deep bidirectional LSTM, (IEEE Workshop on Automatic Speech Recognition and Understanding, Olomouc, Czech Republic), 2013, pp. 273-278.
  12. A. Vaswami, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, Attention is all you need, Adv. Neural Inf. Process. Syst. 30 (2017), 5998-6008.
  13. S. Karita, N. E. Y. Soplin, S. Watanabe, M. Delcroix, A. Ogawa, and T. Nakatani, Improving transformer-based end-to-end speech recognition with connectionist temporal classification and language model integration, (Annual Conference of the International Speech Communication Association, Interspeech, Graz, Austria), 2019, pp. 1408-1412.
  14. H. Miao, G. Cheng, C. Gao, P. Zhang, and Y. Yan, Transformer-based online CTC/attention end-to-end speech recognition architecture (Proc. IEEE International Conf. Acoustics, Speech and Signal Processing, ICASSP, Barcelona, Spain), 2020, pp. 6084-6088.
  15. X. Chang, W. Zhang, Y. Qian, J. Le Roux, and S. Watanabe, End-to-end multi-speaker speech recognition with transformer, (IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP, Barcelona, Spain), 2020, pp. 6134-6138.
  16. T. Hori, N. Moritz, C. Hori, and J. Le oux, Transformer-based long context end-to-end speech recognition, (Annual Conference of the International Speech Communication Association, Interspeech, Shanghai, China), 2020, pp. 5011-5015.
  17. S. Watanabe, T. Hori, S. Karita, and others, ESPnet: End-to-end speech processing toolkit, (Annual Conference of the International Speech Communication Association, Interspeech, Hyderabad, India), 2018, pp. 2207-2211.
  18. Y. R. Oh, K. Y. Park, and J. G. Park, Fast offline transformer-based end-to-end automatic speech recognition for real-world applications, ETRI J. 44 (2022), 476-490. https://doi.org/10.4218/etrij.2021-0106
  19. J. U. Bang, J. G. Maeng, J. Park, S. Yun, and S. H. Kim, English-Korean speech translation corpus (EnKoST-C): construction procedure and evaluation results, ETRI J. 45 (2023), 18-27.
  20. T. Ochiai, S. Watanabe, T. Hori, and J. R. Hershey, Multichannel end-to-end speech recognition, (34th International Conference on Machine Learning, Sydney, Australia), 2017, pp. 2632-2641.
  21. T. Hori, R. Astudillo, T. Hayashi, Y. Zhang, S. Watanabe, and J. Le Roux, Cycle-consistency training for end-to-end speech recognition, (IEEE International Conference Acoustics, Speech and Signal Processing, ICASSP, Brighton, UK), 2019, pp. 6271-6275.
  22. L. Lamel, J.-L. Gauvain, and G. Adda, Lightly supervised and unsupervised acoustic model training, Comput. Speech Lang. 16 (2002), 115-129. https://doi.org/10.1006/csla.2001.0186
  23. J. Ma and R. Schwartz, Unsupervised versus supervised training of acoustic models, (Ninth Annual Conf. International Speech Communication Association, Brisbane, Australia), 2008, pp. 2374-2377.
  24. B. Li, T. N. Sainath, R. Pang, and Z. Wu, Semi-supervised training for end-to-end models via weak distillation, (IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP, Brighton, UK), 2019, pp. 2837-2841.
  25. B. O. Kang, H. B. Jeon, and J. G. Park, Speech recognition for task domains with sparse matched training data, Appl. Sci. 10 (2020), 6155.
  26. Y. Chen, W. Wang, and C. Wang, Semi-supervised ASR by end-to-end self-training, arXiv Preprint, (2020), DOI 10.48550/arXiv. 2001.09128.
  27. A. H. Liu, W. N. Hsu, M. Auli, and A. Baevski, Towards end-to-end unsupervised speech recognition, (2022 IEEE Spoken Language Technology Workshop, SLT, Doha, Qatar), 2022, pp. 221-228.
  28. H. Chung, H. B. Jeon, and J. G. Park, Semi-supervised training for sequence-to-sequence speech recognition using reinforcement learning, (2020 International Joint Conference on Neural Networks, IJCNN, Glasgow, UK), 2020, pp. 1-6.
  29. Y. Zhang, J. Qin, D. S. Park, W. Han, C. C. Chiu, R. Pang, Q. V. Le, and Y. Wu, Pushing the limits of semi-supervised learning for automatic speech recognition, arXiv Preprint, (2020), DOI 10.48550/arXiv.2010.10504.
  30. C. Wang, J. Pino, and J. Gu, Improving cross-lingual transfer learning for end-to-end speech recognition with speech translation, arXiv Preprint, (2020), DOI 10.48550/arXiv.2006.05474.
  31. B. O. Kang, H. B. Jeon, and J. G. Park, A study on transfer learning method for speech recognition in domains with sparse speech data, (Winter Annual Conference of KICS, Kangwon, Republic of Korea), 2021.
  32. D. Yu, K. Yao, H. Su, G. Li, and F. Seide, KL-divergence regularized deep neural network adaptation for improved large vocabulary speech recognition, (IEEE International Conference on Acoustics, Speech and Signal Processing, Vancouver, Canada), 2013, pp. 7893-7897.
  33. D. Povey, A. Ghoshal, G. Boulianne, and others, The Kaldi speech recognition toolkit, (IEEE Workshop on Automatic Speech Recognition and Understanding, ASRU), 2011.
  34. A. Stolcke, SRILM-an extensible language modeling toolkit, (Proc. International Conf. Spoken Language Process, Denver, CO, USA), 2002, pp. 901-904.
  35. H. B. Jeon and S. Y. Lee, Language model adaptation based on topic probability of latent dirichlet allocation, ETRI J. 38 (2016), 487-493. https://doi.org/10.4218/etrij.16.0115.0499
  36. A. Kannan, Y. Wu, P. Nguyen, T. N. Sainath, Z. Chen, and R. Prabhavalkar, An analysis of incorporating an external language model into a sequence-to-sequence model, (IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP, Calgary, Canada), 2018.
  37. A. Gulati, J. Qin, C. C. Chiu, N. Parmar, Y. Zhang, J. Yu, W. Han, S. Wang, Z. Zhang, Y. Wu, and R. Pang, Conformer: convolution-augmented transformer for speech recognition, arXiv Preprint, (2020), DOI 10.48550/arXiv.2005.08100.
  38. Y. Peng, S. Dalmia, I. Lane, and S. Watanabe, Branchformer: parallel MLP-attention architectures to capture local and global context for speech recognition and understanding, (International Conference on Machine Learning, Baltimore, MD, USA), 2022. pp. 17627-17643.
  39. K. Kim, F. Wu, Y. Peng, J. Pan, P. Sridhar, K. J. Han, and S. Watanabe, E-branchformer: branchformer with enhanced merging for speech recognition, (2022 IEEE Spoken Language Technology Workshop, SLT, Doha, Qatar), 2023. pp. 84-91.
  40. A. Baevski, Y. Zhou, A. Mohamed, and M. Auli, wav2vec 2.0: a framework for self-supervised learning of speech representations, Adv. Neural Inf. Proc. Syst. 33 (2020), 12449-12460.
  41. W. N. Hsu, B. Bolte, Y. H. Tsai, K. Lakhotia, R. Salakhutdinov, and A. Mohamed, Hubert: self-supervised speech representation learning by masked prediction of hidden units, IEEE/ACM Trans. Audio Speech Lang. Process. 29 (2021), 3451-3460. https://doi.org/10.1109/TASLP.2021.3122291