Improving transformer-based acoustic model performance using sequence discriminative training

Lee, Chae-Won;Chang, Joon-Hyuk;

doi:10.7776/ASK.2022.41.3.335

The Journal of the Acoustical Society of Korea (한국음향학회지)

Volume 41 Issue 3
/
Pages.335-341
/
2022
/
1225-4428(pISSN)
/
2287-3775(eISSN)

The Acoustical Society of Korea (한국음향학회)

DOI QR Code

Improving transformer-based acoustic model performance using sequence discriminative training

Sequence dicriminative training 기법을 사용한 트랜스포머 기반 음향 모델 성능 향상

Lee, Chae-Won ;
Chang, Joon-Hyuk (Department of Electronic Engineering, Hanyang University)

이채원 (한양대학교 융합전자공학과) ;
장준혁 (한양대학교 융합전자공학과)

Received : 2022.03.21
Accepted : 2022.05.09
Published : 2022.05.31

https://doi.org/10.7776/ASK.2022.41.3.335 Citation PDF KSCI

Download PDF

⟨ Previous Next ⟩

Abstract

In this paper, we adopt a transformer that shows remarkable performance in natural language processing as an acoustic model of hybrid speech recognition. The transformer acoustic model uses attention structures to process sequential data and shows high performance with low computational cost. This paper proposes a method to improve the performance of transformer AM by applying each of the four algorithms of sequence discriminative training, a weighted finite-state transducer (wFST)-based learning used in the existing DNN-HMM model. In addition, compared to the Cross Entropy (CE) learning method, sequence discriminative method shows 5 % of the relative Word Error Rate (WER).

본 논문에서는 기존 자연어 처리 분야에서 뛰어난 성능을 보이는 트랜스포머를 하이브리드 음성인식에서의 음향모델로 사용하였다. 트랜스포머 음향모델은 attention 구조를 사용하여 시계열 데이터를 처리하며 연산량이 낮으면서 높은 성능을 보인다. 본 논문은 이러한 트랜스포머 AM에 기존 DNN-HMM 모델에서 사용하는 가중 유한 상태 전이기(weighted Finite-State Transducer, wFST) 기반 학습인 시퀀스 분류 학습의 네 가지 알고리즘을 각각 적용하여 성능을 높이는 방법을 제안한다. 또한 기존 Cross Entropy(CE)를 사용한 학습방식과 비교하여 5 %의 상대적 word error rate(WER) 감소율을 보였다.

Keywords

Acknowledgement

이 논문은 2021년도 정부(과학기술정보통신부)의 재원으로 정보통신기획평가원의 지원을 받아 수행된 연구임(No.2021-0-00456, 원격 다자간 영상회의에서의 음성 품질 고도화 기술개발)

References

B. Juang and L. Rabiner, "Hidden Markov models for speech recognition," Technometrics, 33, 251-272 (1991). https://doi.org/10.1080/00401706.1991.10484833
A. Senior, H. Sak, and I. Shafran, "Context dependent phone models for LSTM RNN acoustic modelling," Proc. IEEE ICASSP, 4585-4589 (2015).
J. Li, V. Lavrukhin, B. Ginsburg, and R. Leary, "Jasper: An end-to-end convolutional neural acoustic model," arXiv preprint arXiv:1904.03288 (2019).
K. Chen and Q. Huo, "Training deep bidirectional LSTM acoustic model for LVCSR by a context-sensitive-chunk BPTT approach," IEEE/ACM Transactions on Audio, Speech, and Language Processing, 24 (2016).
L. Bahl, P. Brown, P. Souza, and R. Mercer, "Maximum mutual information estimation of hidden Markov model parameters for speech recognition," Proc. ICASSP, 49-52 (1986).
D. Povey, D. Kanevsky, B. Kingsbury, B. Ranabhadran, G. Saon, and K. Visweswariah, "Boosted MMI for model and feature-space discriminative training," Proc. IEEE ICASSP, 4057-4060 (2008).
M. Gibson and T. Hain, "Hypothesis spaces for minimum Bayes risk training in large vocabulary speech recognition," Proc. Interspeech, 2406-2409 (2006).
D. Povey, V. Peddinti, D. Galvez, P. Ghahremani, and V. Manohar, "Purely sequence-trained neural networks for ASR based on lattice-free MMI," Proc. Interspeech, 2751-2755 (2016).
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. Gomez, L. Kaiser, and I. Polosukhin, "Attention is all you need," Advances in neural information processing systems, 30 (2017).
K. Vesely, A. Ghoshal, L. Burget, and D. Povey, "Sequence-discriminative training of deep neural networks," Proc. Interspeech, 2345-2349 (2013).
Y. Wang, A. mohamed, D. Le, C. Liu, and A. Xiao, "Transformer-based acoustic modeling for hybrid speech recognition," Proc. IEEE ICASSP, 6874-6878 (2020).
V. Panayotov, G. Chen, D. Povey, and S.Khudanpur, "Librispeech: an asr corpus based on public domain audio books," Proc. IEEE ICASSP, 5206-5210 (2015).
S. Watanabe, T. Hori, S. karita, and T. Hayashi, "Espnet: End-to-end speech processing toolkit," arXiv preprint arXiv:1804.00015 (2018).
D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz, J. Silovsky, G. Stemmer, and K. Vesely, "The Kaldi speech recognition toolkit," Proc. ASRU, (2011).
A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, and Z. Vito, "Pytorch: An imperative style, high-performance deep learning library," Advances in neural information processing systems, 32 (2019).
L. Lu, X. Xiao, Z. Chen, and Y. Gong, "Pykaldi2: Yet another speech toolkit based on kaldi and pytorch," arXiv preprint arXiv:1907.05955 (2019).
Y. Shao and Y. Wang, "Pychain: A fully parallelized pytorch implementation of lf-mmi for end-to-end asr," arXiv preprint arXiv:2005.09824 (2020).

The Journal of the Acoustical Society of Korea (한국음향학회지)

Improving transformer-based acoustic model performance using sequence discriminative training

Sequence dicriminative training 기법을 사용한 트랜스포머 기반 음향 모델 성능 향상

Abstract

Keywords

Acknowledgement

References

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)