1 |
N. Moritz, T. Hori, and J. Le, Streaming automatic speech recognition with the transformer model, in Proc. IEEE Int. Conf. Acoust., Speech Signal Process. (Barcelona, Spain), May 2020, pp. 6074-6078.
|
2 |
Y. Fujita et al., Insertion-based modeling for end-to-end automatic speech recognition, in Proc. Conf. Int. Speech Commun. Assoc. (Shanghai, China), Oct. 2020, pp. 3660-3664.
|
3 |
K. Park, A robust endpoint detection algorithm for the speech recognition in noisy environments, in Proc. Congr. Expos. Noise Control Eng. (Inter-Noise), (Innsbruck, Austria), Sept. 2013, pp. 5790-5795.
|
4 |
S. Watanabe et al., ESPnet: End-to-end speech processing toolkit, in Proc. Conf. Int. Speech Commun. Assoc. (Hyderabad, India), June 2018, pp. 2207-2211.
|
5 |
V. Roger, J. Farinas, and J. Pinquier, Deep neural networks for automatic speech processing: A survey from large corpora to limited data, arXiv preprint, CoRR, 2020, arXiv: 2003.04241.
|
6 |
H. Chung, J. G. Park, and H. Jung, Rank-weighted reconstruction feature for a robust deep neural network-based acoustic model, ETRI J. 41 (2019), no. 2, 235-241.
DOI
|
7 |
A. B. Nassif et al., Speech recognition using deep neural networks: A systematic review, IEEE Access 7 (2019), 19143-19165.
DOI
|
8 |
J. Li et al., On the comparison of popular end-to-end models for large scale speech recognition, in Proc. Conf. Int. Speech Commun. Assoc. (Shanghai, China), Oct. 2020, pp. 1-5.
|
9 |
D. Bahdanau, K. Cho, and Y. Bengio, Neural machine translation by jointly learning to align and translate, in Proc. Int. Conf. Learn. Represent. (San Diego, CA, USA), May 2015.
|
10 |
L. Dong, S. Xu, and B. Xu, Speech-transformer: A no-recurrence sequence-to-sequence model for speech recognition, in Proc. IEEE Int. Conf. Acoust., Speech Signal Process. (Calgary, Canada), Apr. 2018, pp. 5884-5888.
|
11 |
H. Miao et al., Transformer-based online CTC/attention end-to-end speech recognition architecture, in Proc. IEEE Int. Conf. Acoust., Speech Signal Process. (Barcelona, Spain), May 2020, pp. 6084-6088.
|
12 |
M. Ott et al., Scaling neural machine translation, in Proc. Conf. Mach. Translation (Brussels, Belgium), Oct. 2018, pp. 1-9.
|
13 |
J. U. Bang et al., Automatic construction of a large-scale speech recognition database using multi-genre broadcast data with inaccurate subtitle timestamps, IEICE Trans. Inf. Syst. 103 (2020), no. 2, 406-415.
|
14 |
A. Vaswani et al., Attention is all you need, in Proc. Int. Cont. Neural Inf. Process. Syst. (Long Beach, CA, USA), Dec. 2017, pp. 5998-6008.
|
15 |
N. Moritz, T. Hori, and J. L. Roux, Streaming end-to-end speech recognition with joint CTC-attention based models, in Proc. IEEE Workshop Automat. Speech Recognit. Underst. (Singapore, Singapore), Dec. 2019, pp. 936-943.
|
16 |
S. Zhou et al, Syllable-based sequence-to-sequence speech recognition with the Transformer in Mandarin Chinese, in Proc. Conf. Int. Speech Commun. Assoc. (Hyderabad, India), June 2018, pp. 791-795.
|
17 |
S. Zhou, S. Xu, and B. Xu, Multilingual end-to-end speech recognition with a single transformer on low-resource languages, arXiv preprint, CoRR, 2018, arXiv: 1806.05059.
|
18 |
X. Chang et al., End-to-end multi-speaker speech recognition with transformer, in Proc. IEEE Int. Conf. Acoust., Speech Signal Process. (Barcelona, Spain), May 2020, pp. 6134-6138.
|
19 |
W. Huang et al., Conv-transformer transducer: Low latency, low frame rate, streamable end-to-end speech recognition, in Proc. Conf. Int. Speech Commun. Assoc. (Shanghai, China), Sept. 2020, pp. 5001-5005.
|
20 |
G. I. Winata et al., Adapt-and-adjust: Overcoming the long-tail problem of multilingual speech recognition, arXiv preprint, CoRR, 2020, arXiv: 2012.01687.
|
21 |
S. Li et al., Improving transformer-based speech recognition with unsupervised pre-training and multi-task semantic knowledge learning, in Proc. Conf. Int. Speech Commun. Assoc. (Shanghai, China), Oct. 2020, pp. 5006-5010.
|
22 |
T. Hori et al., Transformer-based long-context end-to-end speech recognition, in Proc. Conf. Int. Speech Commun. Assoc. (Shanghai, China), Oct. 2020, pp. 5011-5015.
|
23 |
X. Zhou et al., Self-and-mixed attention decoder with deep acoustic structure for transformer-based LVCSR, in Proc. Conf. Int. Speech Commun. Assoc. (Shanghai, China), Oct. 2020, pp. 5016-5020.
|
24 |
T. Parcollet, M. Morchid, and G. Linares, E2E-SINCNET: Toward fully end-to-end speech recognition, in Proc. IEEE Int. Conf. Acoust., Speech Signal Process. (Barcelona, Spain), May 2020, pp. 7714-7718.
|
25 |
Y. Higuchi et al., Improved mask-CTC for non-autoregressive end-to-end ASR, arXiv preprint, CoRR, 2020, arXiv: 2010.13270.
|
26 |
Y. Higuchi et al., Mask CTC: Non-autoregressive end-to-end ASR with CTC and mask predict, in Proc. Conf. Int. Speech Commun. Assoc. (Shanghai, China), Oct. 2020, pp. 3655-3659.
|
27 |
Y. Lu et al., Bi-encoder transformer network for MandarinEnglish code-switching speech recognition using mixture of experts, in Proc. Conf. Int. Speech Commun. Assoc. (Shanghai, China), Oct. 2020, pp. 4766-4770.
|
28 |
T. Hori, S. Watanabe, and J. Hershey, Joint CTC/attention decoding for end-to-end speech recognition, in Proc. Annu. Meet. Assoc. Comput. Linguistics (Vancouver, Canada), July 2017, pp. 518-529.
|
29 |
H. Miao et al., Online hybrid CTC/attention end-to-end automatic speech recognition architecture, IEEE/ACM Trans. Audio, Speech, Language Process. 28 (2020), 1452-1465.
DOI
|
30 |
T. Yoshimura et al., End-to-end automatic speech recognition integrated with CTC-based voice activity detection, in Proc. IEEE Int. Conf. Acoust., Speech Signal Process. (Barcelona, Spain), May 2020, pp. 6999-7003.
|
31 |
N. Kitaev, L. Kaiser, and A. Levskaya, Reformer: The efficient transformer, in Proc. Int. Conf. Learn. Represent. (Addis Ababa, Ethiopia), Jan. 2020.
|
32 |
C. Meister, T. Vieira, and R. Cotterell, Best-first beam search, Trans. Assoc. Comput Linguistics 8 (2020), 795-809.
DOI
|
33 |
S. Watanabe et al., Hybrid CTC/attention architecture for end-to-end speech recognition, IEEE J. Sel. Top. Signal Process. 11 (2017), no. 8, 1240-1253.
DOI
|
34 |
P. Zhou et al., Improving generalization of transformer for speech recognition with parallel schedule sampling and relative positional embedding, arXiv preprint, CoRR, 2019, arXiv: abs/1911.00203.
|
35 |
T. Xiao et al., Sharing attention weights for fast transformer, in Proc. Int. Joint Conf. Artif. Intell. (Macao, China), Aug. 2019, pp. 5292-5298.
|
36 |
H. Seki, T. Hori, and S. Watanabe, Vectorization of hypotheses and speech for faster beam search in encoder decoder-based speech recognition, arXiv preprint, CoRR, 2018, arXiv: cs/1811.04568.
|
37 |
D. Amodei et al., Deep speech 2: End-to-end speech recognition in English and Mandarin, in Proc. Int. Conf. Mach. Learn. (New York, NY, USA), June 2016, pp. 173-182.
|
38 |
H. Braun et al., GPU-accelerated Viterbi exact lattice decoder for batched online and offline speech recognition, in Proc. IEEE Int. Conf. Acoust., Speech Signal Process. (Barcelona, Spain), May 2020, pp. 7874-7878.
|
39 |
Y. R. Oh, K. Park, and J. G. Park, Online speech recognition using multichannel parallel acoustic score computation and deep neural network (DNN)-based voice-activity detector, Appl. Sci. 10 (2020), no. 12, 4091-5010.
DOI
|
40 |
A. Graves et al., Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks, in Proc. Int. Conf. Mach. Learn. (Pittsburgh, PA, USA), June 2006, pp. 369-376.
|
41 |
L. Kurzinger et al., Lightweight end-to-end speech recognition from raw audio data using Sinc-convolutions, in Proc. Conf. Int. Speech Commun. Assoc. (Shanghai, China), Oct. 2020, pp. 1659-1663.
|
42 |
I. Sutskever, O. Vinyals, and O. V. Le, Sequence to sequence learning with neural networks, in Proc. Int. Conf. Neural Inf. Process. Syst. (Montreal, Canada), Dec. 2014.
|
43 |
H. Hwang and C. Lee, Linear-time Korean morphological analysis using an action-based local monotonic attention mechanism, ETRI J. 42 (2020), no. 1, 101-107.
DOI
|
44 |
S. H. K. Parthasarathi and N. Strom, Lessons from building acoustic models with a million hours of speech, in Proc. IEEE Int. Conf. Acoust. Speech Signal Process. (Brighton, UK), May 2019, pp. 6670-6674.
|
45 |
A. Gulati et al., Conformer: Convolution-augmented transformer for speech recognition, in Proc. Conf. Int. Speech Commun. Assoc. (Shanghai, China), Oct. 2020, pp. 5036-5040.
|
46 |
S. Karita, Improving Transformer-based end-to-end speech recognition with connectionist temporal classification and language model integration, in Proc. Conf. Int. Speech Commun. Assoc. (Graz, Austria), Sept. 2019, pp. 1408-1412.
|
47 |
T. Moriya et al., Self-distillation for improving CTC-transformer-based ASR systems, in Proc. Conf. Int. Speech Commun. Assoc. (Shanghai, China), Oct. 2020, pp. 546-550.
|
48 |
H. Seki et al., Vectorized beam search for CTC-attention-based speech recognition, in Proc. Conf. Int. Speech Commun. Assoc. (Graz, Austria), Sept. 2019, pp. 3825-3829.
|
49 |
Y. Zhao et al., Cross attention with monotonic alignment for speech Transformer, in Proc. Conf. Int. Speech Commun. Assoc. (Shanghai, China), Oct. 2020, pp. 5031-5035.
|
50 |
S. Karita et al., A comparative study on transformer vs RNN in speech applications, in Proc. IEEE Workshop Automat. Speech Recognit. Underst. (Singapore, Singapore), Dec. 2019, pp. 449-456.
|