[KSCI] Korea Science Citation Index Service

http://dx.doi.org/10.13064/KSSS.2020.12.2.029

Semi-supervised domain adaptation using unlabeled data for end-to-end speech recognition

Jeong, Hyeonjae (School of Electrical Engineering, KAIST)
Goo, Jahyun (School of Electrical Engineering, KAIST)
Kim, Hoirin (School of Electrical Engineering, KAIST)

Publication Information

Phonetics and Speech Sciences / v.12, no.2, 2020 , pp. 29-37 More about this Journal

Abstract

Recently, the neural network-based deep learning algorithm has dramatically improved performance compared to the classical Gaussian mixture model based hidden Markov model (GMM-HMM) automatic speech recognition (ASR) system. In addition, researches on end-to-end (E2E) speech recognition systems integrating language modeling and decoding processes have been actively conducted to better utilize the advantages of deep learning techniques. In general, E2E ASR systems consist of multiple layers of encoder-decoder structure with attention. Therefore, E2E ASR systems require data with a large amount of speech-text paired data in order to achieve good performance. Obtaining speech-text paired data requires a lot of human labor and time, and is a high barrier to building E2E ASR system. Therefore, there are previous studies that improve the performance of E2E ASR system using relatively small amount of speech-text paired data, but most studies have been conducted by using only speech-only data or text-only data. In this study, we proposed a semi-supervised training method that enables E2E ASR system to perform well in corpus in different domains by using both speech or text only data. The proposed method works effectively by adapting to different domains, showing good performance in the target domain and not degrading much in the source domain.

Keywords

automatic speech recognition; end-to-end; semi-supervised; domain adaptation;

Citations & Related Records

Reference

1	Chan, W., Jaitly, N., Le, Q., & Vinyals, O. (2016, March). Listen, attend and spell: A neural network for large vocabulary conversational speech recognition. Proceedings of the 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 4960-4964). Shanghai, China.
2	Tjandra, A., Sakti, S., & Nakamura, S. (2017, December). Listening while speaking: Speech chain by deep learning. IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU) (pp. 301-308). Okinawa, Japan.
3	Vesely, K., Hannemann, M., & Burget, L. (2013, December). Semi-supervised training of deep neural networks. IEEE Workshop on Automatic Speech Recognition and Understanding (pp. 267-272). Olomouc, Czech.
4	Watanabe, S., Hori, T., Karita, S., Hayashi, T., Nishitoba, J., Unno, Y., Soplin, N. E. Y., … Ochiai, T. (2018). ESPnet: End-to-end speech processing toolkit [Computing research repository]. Retrieved from http://arxiv.org/abs/1804.00015
5	Graves, A., Fernández, S., Gomez, F., & Schmidhuber, J. (2006, June). Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks. Proceedings of the 23rd International Conference on Machine Learning (pp. 369-376). Pittsburgh, PA.
6	Graves, A., Mohamed, A. R., & Hinton, G. (2013, May). Speech recognition with deep recurrent neural networks. Proceedings of the 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 6645-6649). Vancouver, Canada.
7	Gulcehre, C., Firat, O., Xu, K., Cho, K., Barrault, L., Lin, H., Bougares, F., ...Bengio, Y. (2015, June). On using monolingual corpora in neural machine translation [Computing research repository]. Retrieved from https://arxiv.org/pdf/1503.03535.pdf
8	Hinton, G., Deng, L., Yu, D., Dahl, G. E., Mohamed, A. R., Jaitly, N., Senior, A., ... Kingsbury, B. (2012). Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Processing Magazine, 29(6), 82-97. DOI
9	Karita, S., Watanabe, S., Iwata, T., Ogawa, A., & Delcroix, M. (2018, September). Semi-supervised end-to-end speech recognition. Proceedings of the International Conference on Spoken Language Processing (INTERSPEECH) (pp. 2-6). Taipei, Taiwan.
10	Miao, Y., Gowayyed, M., & Metez, F. (2015, October). EESEN: End-to-end speech recognition using deep RNN models and WFST-based decoding. Proceedings of the 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU) (pp. 167-174). Scottsdale, AZ.
11	Mikolov, T., Karafiat, M., Burget, L., Cernocky, J., & Khudanpur, S. (2010, September). Recurrent neural network based language model. Proceedings of the 11th Annual Conference of the International Speech Communication Association (pp. 1045-1048). Makuhari, Japan.

KSCI

Semi-supervised domain adaptation using unlabeled data for end-to-end speech recognition 라벨이 없는 데이터를 사용한 종단간 음성인식기의 준교사 방식 도메인 적응

Semi-supervised domain adaptation using unlabeled data for end-to-end speech recognition