Browse > Article
http://dx.doi.org/10.4218/etrij.2021-0106

Fast offline transformer-based end-to-end automatic speech recognition for real-world applications  

Oh, Yoo Rhee (Artificial Intelligence Research Laboratory, Electronics and Telecommunications Research Institute)
Park, Kiyoung (Artificial Intelligence Research Laboratory, Electronics and Telecommunications Research Institute)
Park, Jeon Gue (Artificial Intelligence Research Laboratory, Electronics and Telecommunications Research Institute)
Publication Information
ETRI Journal / v.44, no.3, 2022 , pp. 476-490 More about this Journal
Abstract
With the recent advances in technology, automatic speech recognition (ASR) has been widely used in real-world applications. The efficiency of converting large amounts of speech into text accurately with limited resources has become more vital than ever. In this study, we propose a method to rapidly recognize a large speech database via a transformer-based end-to-end model. Transformers have improved the state-of-the-art performance in many fields. However, they are not easy to use for long sequences. In this study, various techniques to accelerate the recognition of real-world speeches are proposed and tested, including decoding via multiple-utterance-batched beam search, detecting end of speech based on a connectionist temporal classification (CTC), restricting the CTC-prefix score, and splitting long speeches into short segments. Experiments are conducted with the Librispeech dataset and the real-world Korean ASR tasks to verify the proposed methods. From the experiments, the proposed system can convert 8 h of speeches spoken at real-world meetings into text in less than 3 min with a 10.73% character error rate, which is 27.1% relatively lower than that of conventional systems.
Keywords
connectionist temporal classification; end-to-end; speech recognition; transformer;
Citations & Related Records
Times Cited By KSCI : 2  (Citation Analysis)
연도 인용수 순위
1 N. Moritz, T. Hori, and J. Le, Streaming automatic speech recognition with the transformer model, in Proc. IEEE Int. Conf. Acoust., Speech Signal Process. (Barcelona, Spain), May 2020, pp. 6074-6078.
2 Y. Fujita et al., Insertion-based modeling for end-to-end automatic speech recognition, in Proc. Conf. Int. Speech Commun. Assoc. (Shanghai, China), Oct. 2020, pp. 3660-3664.
3 K. Park, A robust endpoint detection algorithm for the speech recognition in noisy environments, in Proc. Congr. Expos. Noise Control Eng. (Inter-Noise), (Innsbruck, Austria), Sept. 2013, pp. 5790-5795.
4 S. Watanabe et al., ESPnet: End-to-end speech processing toolkit, in Proc. Conf. Int. Speech Commun. Assoc. (Hyderabad, India), June 2018, pp. 2207-2211.
5 V. Roger, J. Farinas, and J. Pinquier, Deep neural networks for automatic speech processing: A survey from large corpora to limited data, arXiv preprint, CoRR, 2020, arXiv: 2003.04241.
6 H. Chung, J. G. Park, and H. Jung, Rank-weighted reconstruction feature for a robust deep neural network-based acoustic model, ETRI J. 41 (2019), no. 2, 235-241.   DOI
7 A. B. Nassif et al., Speech recognition using deep neural networks: A systematic review, IEEE Access 7 (2019), 19143-19165.   DOI
8 J. Li et al., On the comparison of popular end-to-end models for large scale speech recognition, in Proc. Conf. Int. Speech Commun. Assoc. (Shanghai, China), Oct. 2020, pp. 1-5.
9 D. Bahdanau, K. Cho, and Y. Bengio, Neural machine translation by jointly learning to align and translate, in Proc. Int. Conf. Learn. Represent. (San Diego, CA, USA), May 2015.
10 L. Dong, S. Xu, and B. Xu, Speech-transformer: A no-recurrence sequence-to-sequence model for speech recognition, in Proc. IEEE Int. Conf. Acoust., Speech Signal Process. (Calgary, Canada), Apr. 2018, pp. 5884-5888.
11 H. Miao et al., Transformer-based online CTC/attention end-to-end speech recognition architecture, in Proc. IEEE Int. Conf. Acoust., Speech Signal Process. (Barcelona, Spain), May 2020, pp. 6084-6088.
12 M. Ott et al., Scaling neural machine translation, in Proc. Conf. Mach. Translation (Brussels, Belgium), Oct. 2018, pp. 1-9.
13 J. U. Bang et al., Automatic construction of a large-scale speech recognition database using multi-genre broadcast data with inaccurate subtitle timestamps, IEICE Trans. Inf. Syst. 103 (2020), no. 2, 406-415.
14 A. Vaswani et al., Attention is all you need, in Proc. Int. Cont. Neural Inf. Process. Syst. (Long Beach, CA, USA), Dec. 2017, pp. 5998-6008.
15 N. Moritz, T. Hori, and J. L. Roux, Streaming end-to-end speech recognition with joint CTC-attention based models, in Proc. IEEE Workshop Automat. Speech Recognit. Underst. (Singapore, Singapore), Dec. 2019, pp. 936-943.
16 S. Zhou et al, Syllable-based sequence-to-sequence speech recognition with the Transformer in Mandarin Chinese, in Proc. Conf. Int. Speech Commun. Assoc. (Hyderabad, India), June 2018, pp. 791-795.
17 S. Zhou, S. Xu, and B. Xu, Multilingual end-to-end speech recognition with a single transformer on low-resource languages, arXiv preprint, CoRR, 2018, arXiv: 1806.05059.
18 X. Chang et al., End-to-end multi-speaker speech recognition with transformer, in Proc. IEEE Int. Conf. Acoust., Speech Signal Process. (Barcelona, Spain), May 2020, pp. 6134-6138.
19 W. Huang et al., Conv-transformer transducer: Low latency, low frame rate, streamable end-to-end speech recognition, in Proc. Conf. Int. Speech Commun. Assoc. (Shanghai, China), Sept. 2020, pp. 5001-5005.
20 G. I. Winata et al., Adapt-and-adjust: Overcoming the long-tail problem of multilingual speech recognition, arXiv preprint, CoRR, 2020, arXiv: 2012.01687.
21 S. Li et al., Improving transformer-based speech recognition with unsupervised pre-training and multi-task semantic knowledge learning, in Proc. Conf. Int. Speech Commun. Assoc. (Shanghai, China), Oct. 2020, pp. 5006-5010.
22 T. Hori et al., Transformer-based long-context end-to-end speech recognition, in Proc. Conf. Int. Speech Commun. Assoc. (Shanghai, China), Oct. 2020, pp. 5011-5015.
23 X. Zhou et al., Self-and-mixed attention decoder with deep acoustic structure for transformer-based LVCSR, in Proc. Conf. Int. Speech Commun. Assoc. (Shanghai, China), Oct. 2020, pp. 5016-5020.
24 T. Parcollet, M. Morchid, and G. Linares, E2E-SINCNET: Toward fully end-to-end speech recognition, in Proc. IEEE Int. Conf. Acoust., Speech Signal Process. (Barcelona, Spain), May 2020, pp. 7714-7718.
25 Y. Higuchi et al., Improved mask-CTC for non-autoregressive end-to-end ASR, arXiv preprint, CoRR, 2020, arXiv: 2010.13270.
26 Y. Higuchi et al., Mask CTC: Non-autoregressive end-to-end ASR with CTC and mask predict, in Proc. Conf. Int. Speech Commun. Assoc. (Shanghai, China), Oct. 2020, pp. 3655-3659.
27 Y. Lu et al., Bi-encoder transformer network for MandarinEnglish code-switching speech recognition using mixture of experts, in Proc. Conf. Int. Speech Commun. Assoc. (Shanghai, China), Oct. 2020, pp. 4766-4770.
28 T. Hori, S. Watanabe, and J. Hershey, Joint CTC/attention decoding for end-to-end speech recognition, in Proc. Annu. Meet. Assoc. Comput. Linguistics (Vancouver, Canada), July 2017, pp. 518-529.
29 H. Miao et al., Online hybrid CTC/attention end-to-end automatic speech recognition architecture, IEEE/ACM Trans. Audio, Speech, Language Process. 28 (2020), 1452-1465.   DOI
30 T. Yoshimura et al., End-to-end automatic speech recognition integrated with CTC-based voice activity detection, in Proc. IEEE Int. Conf. Acoust., Speech Signal Process. (Barcelona, Spain), May 2020, pp. 6999-7003.
31 N. Kitaev, L. Kaiser, and A. Levskaya, Reformer: The efficient transformer, in Proc. Int. Conf. Learn. Represent. (Addis Ababa, Ethiopia), Jan. 2020.
32 C. Meister, T. Vieira, and R. Cotterell, Best-first beam search, Trans. Assoc. Comput Linguistics 8 (2020), 795-809.   DOI
33 S. Watanabe et al., Hybrid CTC/attention architecture for end-to-end speech recognition, IEEE J. Sel. Top. Signal Process. 11 (2017), no. 8, 1240-1253.   DOI
34 P. Zhou et al., Improving generalization of transformer for speech recognition with parallel schedule sampling and relative positional embedding, arXiv preprint, CoRR, 2019, arXiv: abs/1911.00203.
35 T. Xiao et al., Sharing attention weights for fast transformer, in Proc. Int. Joint Conf. Artif. Intell. (Macao, China), Aug. 2019, pp. 5292-5298.
36 H. Seki, T. Hori, and S. Watanabe, Vectorization of hypotheses and speech for faster beam search in encoder decoder-based speech recognition, arXiv preprint, CoRR, 2018, arXiv: cs/1811.04568.
37 D. Amodei et al., Deep speech 2: End-to-end speech recognition in English and Mandarin, in Proc. Int. Conf. Mach. Learn. (New York, NY, USA), June 2016, pp. 173-182.
38 H. Braun et al., GPU-accelerated Viterbi exact lattice decoder for batched online and offline speech recognition, in Proc. IEEE Int. Conf. Acoust., Speech Signal Process. (Barcelona, Spain), May 2020, pp. 7874-7878.
39 Y. R. Oh, K. Park, and J. G. Park, Online speech recognition using multichannel parallel acoustic score computation and deep neural network (DNN)-based voice-activity detector, Appl. Sci. 10 (2020), no. 12, 4091-5010.   DOI
40 A. Graves et al., Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks, in Proc. Int. Conf. Mach. Learn. (Pittsburgh, PA, USA), June 2006, pp. 369-376.
41 L. Kurzinger et al., Lightweight end-to-end speech recognition from raw audio data using Sinc-convolutions, in Proc. Conf. Int. Speech Commun. Assoc. (Shanghai, China), Oct. 2020, pp. 1659-1663.
42 I. Sutskever, O. Vinyals, and O. V. Le, Sequence to sequence learning with neural networks, in Proc. Int. Conf. Neural Inf. Process. Syst. (Montreal, Canada), Dec. 2014.
43 H. Hwang and C. Lee, Linear-time Korean morphological analysis using an action-based local monotonic attention mechanism, ETRI J. 42 (2020), no. 1, 101-107.   DOI
44 S. H. K. Parthasarathi and N. Strom, Lessons from building acoustic models with a million hours of speech, in Proc. IEEE Int. Conf. Acoust. Speech Signal Process. (Brighton, UK), May 2019, pp. 6670-6674.
45 A. Gulati et al., Conformer: Convolution-augmented transformer for speech recognition, in Proc. Conf. Int. Speech Commun. Assoc. (Shanghai, China), Oct. 2020, pp. 5036-5040.
46 S. Karita, Improving Transformer-based end-to-end speech recognition with connectionist temporal classification and language model integration, in Proc. Conf. Int. Speech Commun. Assoc. (Graz, Austria), Sept. 2019, pp. 1408-1412.
47 T. Moriya et al., Self-distillation for improving CTC-transformer-based ASR systems, in Proc. Conf. Int. Speech Commun. Assoc. (Shanghai, China), Oct. 2020, pp. 546-550.
48 H. Seki et al., Vectorized beam search for CTC-attention-based speech recognition, in Proc. Conf. Int. Speech Commun. Assoc. (Graz, Austria), Sept. 2019, pp. 3825-3829.
49 Y. Zhao et al., Cross attention with monotonic alignment for speech Transformer, in Proc. Conf. Int. Speech Commun. Assoc. (Shanghai, China), Oct. 2020, pp. 5031-5035.
50 S. Karita et al., A comparative study on transformer vs RNN in speech applications, in Proc. IEEE Workshop Automat. Speech Recognit. Underst. (Singapore, Singapore), Dec. 2019, pp. 449-456.