DOI QR코드

DOI QR Code

Fast offline transformer-based end-to-end automatic speech recognition for real-world applications

  • Oh, Yoo Rhee (Artificial Intelligence Research Laboratory, Electronics and Telecommunications Research Institute) ;
  • Park, Kiyoung (Artificial Intelligence Research Laboratory, Electronics and Telecommunications Research Institute) ;
  • Park, Jeon Gue (Artificial Intelligence Research Laboratory, Electronics and Telecommunications Research Institute)
  • Received : 2021.03.26
  • Accepted : 2021.08.31
  • Published : 2022.06.10

Abstract

With the recent advances in technology, automatic speech recognition (ASR) has been widely used in real-world applications. The efficiency of converting large amounts of speech into text accurately with limited resources has become more vital than ever. In this study, we propose a method to rapidly recognize a large speech database via a transformer-based end-to-end model. Transformers have improved the state-of-the-art performance in many fields. However, they are not easy to use for long sequences. In this study, various techniques to accelerate the recognition of real-world speeches are proposed and tested, including decoding via multiple-utterance-batched beam search, detecting end of speech based on a connectionist temporal classification (CTC), restricting the CTC-prefix score, and splitting long speeches into short segments. Experiments are conducted with the Librispeech dataset and the real-world Korean ASR tasks to verify the proposed methods. From the experiments, the proposed system can convert 8 h of speeches spoken at real-world meetings into text in less than 3 min with a 10.73% character error rate, which is 27.1% relatively lower than that of conventional systems.

Keywords

Acknowledgement

This work was supported by Institute for Information & communications Technology Planning & Evaluation(IITP) grant funded by the Korea government(MSIT) (No.2019-0-01376, Development of the multi-speaker conversational speech recognition technology).

References

  1. H. Chung, J. G. Park, and H. Jung, Rank-weighted reconstruction feature for a robust deep neural network-based acoustic model, ETRI J. 41 (2019), no. 2, 235-241. https://doi.org/10.4218/etrij.2018-0189
  2. A. B. Nassif et al., Speech recognition using deep neural networks: A systematic review, IEEE Access 7 (2019), 19143-19165. https://doi.org/10.1109/access.2019.2896880
  3. J. Li et al., On the comparison of popular end-to-end models for large scale speech recognition, in Proc. Conf. Int. Speech Commun. Assoc. (Shanghai, China), Oct. 2020, pp. 1-5.
  4. V. Roger, J. Farinas, and J. Pinquier, Deep neural networks for automatic speech processing: A survey from large corpora to limited data, arXiv preprint, CoRR, 2020, arXiv: 2003.04241.
  5. D. Bahdanau, K. Cho, and Y. Bengio, Neural machine translation by jointly learning to align and translate, in Proc. Int. Conf. Learn. Represent. (San Diego, CA, USA), May 2015.
  6. I. Sutskever, O. Vinyals, and O. V. Le, Sequence to sequence learning with neural networks, in Proc. Int. Conf. Neural Inf. Process. Syst. (Montreal, Canada), Dec. 2014.
  7. L. Dong, S. Xu, and B. Xu, Speech-transformer: A no-recurrence sequence-to-sequence model for speech recognition, in Proc. IEEE Int. Conf. Acoust., Speech Signal Process. (Calgary, Canada), Apr. 2018, pp. 5884-5888.
  8. A. Vaswani et al., Attention is all you need, in Proc. Int. Cont. Neural Inf. Process. Syst. (Long Beach, CA, USA), Dec. 2017, pp. 5998-6008.
  9. H. Hwang and C. Lee, Linear-time Korean morphological analysis using an action-based local monotonic attention mechanism, ETRI J. 42 (2020), no. 1, 101-107. https://doi.org/10.4218/etrij.2018-0456
  10. N. Moritz, T. Hori, and J. Le, Streaming automatic speech recognition with the transformer model, in Proc. IEEE Int. Conf. Acoust., Speech Signal Process. (Barcelona, Spain), May 2020, pp. 6074-6078.
  11. N. Moritz, T. Hori, and J. L. Roux, Streaming end-to-end speech recognition with joint CTC-attention based models, in Proc. IEEE Workshop Automat. Speech Recognit. Underst. (Singapore, Singapore), Dec. 2019, pp. 936-943.
  12. S. H. K. Parthasarathi and N. Strom, Lessons from building acoustic models with a million hours of speech, in Proc. IEEE Int. Conf. Acoust. Speech Signal Process. (Brighton, UK), May 2019, pp. 6670-6674.
  13. S. Zhou et al, Syllable-based sequence-to-sequence speech recognition with the Transformer in Mandarin Chinese, in Proc. Conf. Int. Speech Commun. Assoc. (Hyderabad, India), June 2018, pp. 791-795.
  14. S. Zhou, S. Xu, and B. Xu, Multilingual end-to-end speech recognition with a single transformer on low-resource languages, arXiv preprint, CoRR, 2018, arXiv: 1806.05059.
  15. A. Gulati et al., Conformer: Convolution-augmented transformer for speech recognition, in Proc. Conf. Int. Speech Commun. Assoc. (Shanghai, China), Oct. 2020, pp. 5036-5040.
  16. S. Karita et al., A comparative study on transformer vs RNN in speech applications, in Proc. IEEE Workshop Automat. Speech Recognit. Underst. (Singapore, Singapore), Dec. 2019, pp. 449-456.
  17. W. Huang et al., Conv-transformer transducer: Low latency, low frame rate, streamable end-to-end speech recognition, in Proc. Conf. Int. Speech Commun. Assoc. (Shanghai, China), Sept. 2020, pp. 5001-5005.
  18. S. Karita, Improving Transformer-based end-to-end speech recognition with connectionist temporal classification and language model integration, in Proc. Conf. Int. Speech Commun. Assoc. (Graz, Austria), Sept. 2019, pp. 1408-1412.
  19. G. I. Winata et al., Adapt-and-adjust: Overcoming the long-tail problem of multilingual speech recognition, arXiv preprint, CoRR, 2020, arXiv: 2012.01687.
  20. H. Miao et al., Transformer-based online CTC/attention end-to-end speech recognition architecture, in Proc. IEEE Int. Conf. Acoust., Speech Signal Process. (Barcelona, Spain), May 2020, pp. 6084-6088.
  21. L. Kurzinger et al., Lightweight end-to-end speech recognition from raw audio data using Sinc-convolutions, in Proc. Conf. Int. Speech Commun. Assoc. (Shanghai, China), Oct. 2020, pp. 1659-1663.
  22. S. Li et al., Improving transformer-based speech recognition with unsupervised pre-training and multi-task semantic knowledge learning, in Proc. Conf. Int. Speech Commun. Assoc. (Shanghai, China), Oct. 2020, pp. 5006-5010.
  23. T. Hori et al., Transformer-based long-context end-to-end speech recognition, in Proc. Conf. Int. Speech Commun. Assoc. (Shanghai, China), Oct. 2020, pp. 5011-5015.
  24. T. Moriya et al., Self-distillation for improving CTC-transformer-based ASR systems, in Proc. Conf. Int. Speech Commun. Assoc. (Shanghai, China), Oct. 2020, pp. 546-550.
  25. X. Chang et al., End-to-end multi-speaker speech recognition with transformer, in Proc. IEEE Int. Conf. Acoust., Speech Signal Process. (Barcelona, Spain), May 2020, pp. 6134-6138.
  26. X. Zhou et al., Self-and-mixed attention decoder with deep acoustic structure for transformer-based LVCSR, in Proc. Conf. Int. Speech Commun. Assoc. (Shanghai, China), Oct. 2020, pp. 5016-5020.
  27. Y. Fujita et al., Insertion-based modeling for end-to-end automatic speech recognition, in Proc. Conf. Int. Speech Commun. Assoc. (Shanghai, China), Oct. 2020, pp. 3660-3664.
  28. Y. Higuchi et al., Improved mask-CTC for non-autoregressive end-to-end ASR, arXiv preprint, CoRR, 2020, arXiv: 2010.13270.
  29. Y. Higuchi et al., Mask CTC: Non-autoregressive end-to-end ASR with CTC and mask predict, in Proc. Conf. Int. Speech Commun. Assoc. (Shanghai, China), Oct. 2020, pp. 3655-3659.
  30. Y. Lu et al., Bi-encoder transformer network for MandarinEnglish code-switching speech recognition using mixture of experts, in Proc. Conf. Int. Speech Commun. Assoc. (Shanghai, China), Oct. 2020, pp. 4766-4770.
  31. Y. Zhao et al., Cross attention with monotonic alignment for speech Transformer, in Proc. Conf. Int. Speech Commun. Assoc. (Shanghai, China), Oct. 2020, pp. 5031-5035.
  32. T. Parcollet, M. Morchid, and G. Linares, E2E-SINCNET: Toward fully end-to-end speech recognition, in Proc. IEEE Int. Conf. Acoust., Speech Signal Process. (Barcelona, Spain), May 2020, pp. 7714-7718.
  33. D. Amodei et al., Deep speech 2: End-to-end speech recognition in English and Mandarin, in Proc. Int. Conf. Mach. Learn. (New York, NY, USA), June 2016, pp. 173-182.
  34. H. Braun et al., GPU-accelerated Viterbi exact lattice decoder for batched online and offline speech recognition, in Proc. IEEE Int. Conf. Acoust., Speech Signal Process. (Barcelona, Spain), May 2020, pp. 7874-7878.
  35. Y. R. Oh, K. Park, and J. G. Park, Online speech recognition using multichannel parallel acoustic score computation and deep neural network (DNN)-based voice-activity detector, Appl. Sci. 10 (2020), no. 12, 4091-5010. https://doi.org/10.3390/app10124091
  36. H. Seki et al., Vectorized beam search for CTC-attention-based speech recognition, in Proc. Conf. Int. Speech Commun. Assoc. (Graz, Austria), Sept. 2019, pp. 3825-3829.
  37. H. Seki, T. Hori, and S. Watanabe, Vectorization of hypotheses and speech for faster beam search in encoder decoder-based speech recognition, arXiv preprint, CoRR, 2018, arXiv: cs/1811.04568.
  38. A. Graves et al., Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks, in Proc. Int. Conf. Mach. Learn. (Pittsburgh, PA, USA), June 2006, pp. 369-376.
  39. H. Miao et al., Online hybrid CTC/attention end-to-end automatic speech recognition architecture, IEEE/ACM Trans. Audio, Speech, Language Process. 28 (2020), 1452-1465. https://doi.org/10.1109/taslp.2020.2987752
  40. T. Yoshimura et al., End-to-end automatic speech recognition integrated with CTC-based voice activity detection, in Proc. IEEE Int. Conf. Acoust., Speech Signal Process. (Barcelona, Spain), May 2020, pp. 6999-7003.
  41. T. Hori, S. Watanabe, and J. Hershey, Joint CTC/attention decoding for end-to-end speech recognition, in Proc. Annu. Meet. Assoc. Comput. Linguistics (Vancouver, Canada), July 2017, pp. 518-529.
  42. C. Meister, T. Vieira, and R. Cotterell, Best-first beam search, Trans. Assoc. Comput Linguistics 8 (2020), 795-809. https://doi.org/10.1162/tacl_a_00346
  43. S. Watanabe et al., Hybrid CTC/attention architecture for end-to-end speech recognition, IEEE J. Sel. Top. Signal Process. 11 (2017), no. 8, 1240-1253. https://doi.org/10.1109/JSTSP.2017.2763455
  44. P. Zhou et al., Improving generalization of transformer for speech recognition with parallel schedule sampling and relative positional embedding, arXiv preprint, CoRR, 2019, arXiv: abs/1911.00203.
  45. N. Kitaev, L. Kaiser, and A. Levskaya, Reformer: The efficient transformer, in Proc. Int. Conf. Learn. Represent. (Addis Ababa, Ethiopia), Jan. 2020.
  46. K. Park, A robust endpoint detection algorithm for the speech recognition in noisy environments, in Proc. Congr. Expos. Noise Control Eng. (Inter-Noise), (Innsbruck, Austria), Sept. 2013, pp. 5790-5795.
  47. S. Watanabe et al., ESPnet: End-to-end speech processing toolkit, in Proc. Conf. Int. Speech Commun. Assoc. (Hyderabad, India), June 2018, pp. 2207-2211.
  48. T. Xiao et al., Sharing attention weights for fast transformer, in Proc. Int. Joint Conf. Artif. Intell. (Macao, China), Aug. 2019, pp. 5292-5298.
  49. M. Ott et al., Scaling neural machine translation, in Proc. Conf. Mach. Translation (Brussels, Belgium), Oct. 2018, pp. 1-9.
  50. J. U. Bang et al., Automatic construction of a large-scale speech recognition database using multi-genre broadcast data with inaccurate subtitle timestamps, IEICE Trans. Inf. Syst. 103 (2020), no. 2, 406-415.