Acoustic model training using self-attention for low-resource speech recognition

Park, Hosung;Kim, Ji-Hwan;

doi:10.7776/ASK.2020.39.5.483

The Journal of the Acoustical Society of Korea (한국음향학회지)

Volume 39 Issue 5
/
Pages.483-489
/
2020
/
1225-4428(pISSN)
/
2287-3775(eISSN)

The Acoustical Society of Korea (한국음향학회)

DOI QR Code

Acoustic model training using self-attention for low-resource speech recognition

저자원 환경의 음성인식을 위한 자기 주의를 활용한 음향 모델 학습

Park, Hosung ;
Kim, Ji-Hwan (Department of Computer Science and Engineering, Sogang University)

박호성 (서강대학교 컴퓨터공학과) ;
김지환 (서강대학교 컴퓨터공학과)

Received : 2020.08.07
Accepted : 2020.09.04
Published : 2020.09.30

https://doi.org/10.7776/ASK.2020.39.5.483 Citation PDF KSCI

Download PDF

⟨ Previous Next ⟩

Abstract

This paper proposes acoustic model training using self-attention for low-resource speech recognition. In low-resource speech recognition, it is difficult for acoustic model to distinguish certain phones. For example, plosive /d/ and /t/, plosive /g/ and /k/ and affricate /z/ and /ch/. In acoustic model training, the self-attention generates attention weights from the deep neural network model. In this study, these weights handle the similar pronunciation error for low-resource speech recognition. When the proposed method was applied to Time Delay Neural Network-Output gate Projected Gated Recurrent Unit (TNDD-OPGRU)-based acoustic model, the proposed model showed a 5.98 % word error rate. It shows absolute improvement of 0.74 % compared with TDNN-OPGRU model.

본 논문에서는 저자원 환경의 음성인식에서 음향 모델의 성능을 높이기 위한 음향 모델 학습 방법을 제안한다. 저자원 환경이란, 음향 모델에서 100시간 미만의 학습 자료를 사용한 환경을 말한다. 저자원 환경의 음성인식에서는 음향 모델이 유사한 발음들을 잘 구분하지 못하는 문제가 발생한다. 예를 들면, 파열음 /d/와 /t/, 파열음 /g/와 /k/, 파찰음 /z/와 /ch/ 등의 발음은 저자원 환경에서 잘 구분하지 못한다. 자기 주의 메커니즘은 깊은 신경망 모델로부터 출력된 벡터에 대해 가중치를 부여하며, 이를 통해 저자원 환경에서 발생할 수 있는 유사한 발음 오류 문제를 해결한다. 음향 모델에서 좋은 성능을 보이는 Time Delay Neural Network(TDNN)과 Output gate Projected Gated Recurrent Unit(OPGRU)의 혼합 모델에 자기 주의 기반 학습 방법을 적용했을 때, 51.6 h 분량의 학습 자료를 사용한 한국어 음향 모델에 대하여 단어 오류율 기준 5.98 %의 성능을 보여 기존 기술 대비 0.74 %의 절대적 성능 개선을 보였다.

Keywords

References

C. Weng and D. Yu, "A comparison of lattice-free discriminative training criteria for purely sequencetrained neural network acoustic models," Proc. ICASSP. 6430-6434 (2019).
W. Michel, R. Schluter, and H. Ney, "Comparison of lattice-free and lattice-based sequence discriminative training criteria for LVCSR," arXiv:1907.01409 (2019).
J. Jorge, A. Gimenez, J. Iranzo-Sanchez, J. Civera, A. Sanchis, and A. Juan, "Real-time one-pass decoder for speech recognition using LSTM language models," Proc. Interspeech, 3820-3824 (2019).
J. Y. Chung, C. Gulcehre, K. H. Cho, and Y. Bengio, "Empirical evaluation of gated recurrent neural networks on sequence modeling," arXiv:1412.3555 (2014).
V. Peddinti, D. Povey, and S. Khudanpur, "A time delay neural network architecture for efficient modeling of long temporal contexts," Proc. Interspeech, 2-6 (2015).
B. Christian and T. Griffiths, Algorithms to Llive by: The Computer Science of Human Decisions Chapter 7: Overfitting (William Collins, Hampshire, 2017), pp. 149-168.
D. Povey, H. Hadian, P. Ghahremani, K. Li, and S. Khudanpur, "A time-restricted self-attention layer for ASR," Proc. ICASSP. 5874-5878 (2018).
G. Cheng, D. Povey, L. Huang, J. Xu, S. Khudanpur, and Y. Yan, "Output-gate projected gated recurrent unit for speech recognition," Proc. Interspeech, 1793-1797 (2018).
D. Bahdanau, K. H. Cho, and Y. Bengio, "Neural machine translation by jointly learning to align and translate," Proc. ICLR. 1-15 (2015).
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, "Attention is all you need," Proc. NIPS. 5999-6009 (2017).
M.-T. Luong, H. Pham, and C. D. Manning, "Effective approaches to attention-based neural machine translation," Proc. EMNLP. 1412-1421 (2015).
D. Povey, H. Hadian, P. Ghahremani, K. Li, and S. Khudanpur, "A time-restricted self-attention layer for ASR," Proc. ICASSP. 5874-5878 (2018).
Zeroth Korean, http://openslr.org/40/, (Last viewed June 4, 2020).
A. Stolcke, "SRILM-an extensible language modeling toolkit," Proc. ICSLP. 901-904 (2002).
H. Xu, T. Chen, D. Gao, Y. Wang, K. Li, N. Goel, Y. Carmiel, D. Povey, and S. Khudanpur, "a pruned RNNLM lattice-rescoring algorithm for automatic speech recognition," Proc. ICASSP. 5929-5933 (2018).
D. S. Park, W. Chan, Y. Zhang, C.-C. Chiu, B. Zoph, E. D. Cubuk, and Q. V. Le, "SpecAugment: a simple data augmentation method for automatic speech recognition," Proc. Interspeech, 2613-2617 (2019).
Y.-Y. Wang, A. Acero, and C. Chelba, "Is word error rate a good indicator for spoken language understanding accuracy," Proc. ASRU. 577-582 (2003).

The Journal of the Acoustical Society of Korea (한국음향학회지)

Acoustic model training using self-attention for low-resource speech recognition

저자원 환경의 음성인식을 위한 자기 주의를 활용한 음향 모델 학습

Abstract

Keywords

References

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)