DOI QR코드

DOI QR Code

어린이 음성인식을 위한 동적 가중 손실 기반 도메인 적대적 훈련

Dynamically weighted loss based domain adversarial training for children's speech recognition

  • Seunghee, Ma (R&D Center, LOTTE Data Communication Company)
  • 투고 : 2022.07.29
  • 심사 : 2022.09.30
  • 발행 : 2022.11.30

초록

어린이 음성인식의 활용 분야가 증가하고 있지만, 양질의 데이터 부족은 어린이 음성인식 성능 향상의 걸림돌이 되고 있다. 본 논문은 성인의 음성 데이터를 추가로 사용하여 어린이 음성인식 성능을 개선하는 방법을 새롭게 제안한다. 제안하는 방법은 성인 학습 데이터양이 증가할수록 커지는 연령 간 데이터 불균형을 효과적으로 다루기 위해 dynamically weighted loss를 사용하여 트랜스포머 기반 도메인 적대적 훈련하는 방식이다. 구체적으로, 학습 중 미니 배치 내 클래스 불균형 정도를 수치화하고, 데이터가 적을수록 큰 가중치를 갖도록 손실함수를 정의하여 사용하였다. 실험에서는 성인과 어린이 학습 데이터 간 비대칭성에 따른 제안된 도메인 적대적 훈련의 효용성을 검증하였다. 실험 결과, 학습 데이터 내 연령 간 비대칭이 발생하는 모든 조건에서 제안하는 방법이 기존 도메인 적대적 훈련 방식보다 높은 어린이 음성인식 성능을 가짐을 확인할 수 있었다.

Although the fields in which is utilized children's speech recognition is on the rise, the lack of quality data is an obstacle to improving children's speech recognition performance. This paper proposes a new method for improving children's speech recognition performance by additionally using adult speech data. The proposed method is a transformer based domain adversarial training using dynamically weighted loss to effectively address the data imbalance gap between age that grows as the amount of adult training data increases. Specifically, the degree of class imbalance in the mini-batch during training was quantified, and the loss function was defined and used so that the smaller the data, the greater the weight. Experiments validate the utility of proposed domain adversarial training following asymmetry between adults and children training data. Experiments show that the proposed method has higher children's speech recognition performance than traditional domain adversarial training method under all conditions in which asymmetry between age occurs in the training data.

키워드

참고문헌

  1. H. Liao, G. Pundak, O. Siohan, M. Carroll, N. Coccaro, Q.-M. Jiang, T. N. Sainath, A. Senior, F. Beaufays, and M. Bacchiani, "Large vocabulary automatic speech recognition for children," Proc. Interspeech, 1611-1615 (2015).
  2. P. G. Shivakumar and P. Georgiou, "Transfer learning from adult to children for speech recognition: Evaluation, analysis and recommendations," Computer speech and language, arXiv:1805.03322 (2020).
  3. L. Rumberg, H. Ehlert, U. Ludtke, and J. Ostermann, "Age-invariant training for end-to-end child speech recognition using adversarial multi-task learning," Proc. Interspeech, 3850-3854 (2021).
  4. V. Kadyan, S. Shanawazuddin, and A. Singh, "Developing children's speech recognition system for low resource Punjabi language," Applied Acoustics, 178, 108002 (2021). https://doi.org/10.1016/j.apacoust.2021.108002
  5. R. Serizel and D. Giuliani, "Vocal tract length normalisation approaches to DNN-based children's and adults' speech recognition," Proc. IEEE SLT, 135-140 (2014).
  6. P. G. Shivakumar, A. Potamianos, S. Lee, and S. S. Narayanan, "Improving speech recognition for children using acoustic adaptation and pronunciation modeling," Proc. WOCCI, 15-19 (2014).
  7. S. S. Gray, D. Willett, J. Lu, J. Pinto, P. Maergner, and N. Bodenstab, "Child automatic speech recognition for US English: child interaction with living-roomelectronic-devices," Proc. WOCCI, 21-26 (2014).
  8. R. Duan and N. F. Chen, "Unsupervised feature adaptation using adversarial multi-task training for automatic evaluation of children's speech," Proc. Interspeech, 3037-3041 (2020).
  9. Y. Cui, M. Jia, T. Y. Lin, Y. Song, and S. Belongie, "Class-balanced loss based on effective number of samples," Proc. IEEE CVPR, 9268-9277 (2019).
  10. A. Sellami and H. Hwang, "A robust deep convolutional neural network with batch-weighted loss for heartbeat classification," Expert Systems with Applications, 122, 75-84 (2019). https://doi.org/10.1016/j.eswa.2018.12.037
  11. K. R. M. Fernando and C. P. Tsokos, "Dynamically weighted balanced loss: class imbalanced learning and confidence calibration of deep neural networks," IEEE Trans. Neural Netw. Learn. Syst. 33, 2940-2951 (2021).
  12. S. Ben-David, J. Blitzer, K. Crammer, and F. Pereira, "Analysis of representations for domain adaptation," Proc. NIPS, 137-144 (2006).
  13. Y. Ganin and V. Lempitsky, "Unsupervised domain adaptation by backpropagation," Proc. ICML, 1180-1189 (2015).
  14. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L.Kaiser, and I. Polosukhin, "Attention is all you need," Proc. NIPS, 5998-6008 (2017).
  15. L. Dong, S. Xu, and B. Xu, "Speech-transformer: a no-recurrence sequence-to-sequence model for speech recognition," Proc. IEEE ICASSP, 5884-5888 (2018).
  16. H. Miao, G. Cheng, C. Gao, P. Zhang, and Y. Yan, "Transformer-based online CTC/attention end-to-end speech recognition architecture," Proc. IEEE ICASSP, 6084-6088 (2020).
  17. Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, M. Marchand, and V. Lempitsky, "Domain-adversarial training of neural networks," J. Mach. Learn. Res. 17, 2096-2030 (2016).
  18. S. Kullback and R. A. Leibler. "On information and sufficiency," Ann. Math. Stat. 22, 79-86 (1951). https://doi.org/10.1214/aoms/1177729694
  19. M. Chen, S. Zhao, H. Liu, and D. Cai, "Adversariallearned loss for domain adaptation," Proc. AAAI, 3521-3528 (2020).
  20. PyTorch 1.12 documentation, "https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html," (Last viewed September 16, 2022).
  21. AI Hub Free Conversation (General Men and Women) Dataset, https://www.aihub.or.kr/aihubdata/data/view.do?currMenu=115&topMenu=100&aihubDataSe=realm&dataSetSn=109, (Last viewed July 27, 2022).
  22. AI Hub Free Conversation (Children, Infants) Dataset, https://www.aihub.or.kr/aihubdata/data/view.do?currMenu=115&topMenu=100&aihubDataSe=realm&dataSetSn=108, (Last viewed July 27, 2022).
  23. S. Watanabe, T. Hori, S. Karita, T. Hayashi, J. Nishitoba, Y. Unno, N. E. Y. Soplin, J. Heymann, M. Wiesner, N. Chen, A. Renduchintala, and T. Ochiai, "ESPnet: end-to-end speech processing toolkit," arXiv: 1804.00015 (2018).
  24. T. Kudo and J. Richardson, "SentencePiece: A simple and language independent subword tokenizer and detokenizer for neural text processing," Proc. EMNLP 66-71 (2018).
  25. A. Tripathi, A. Mohan, S. Anand, and M. Singh, "Adversarial learning of raw speech features for domain invariant speech recognition," Proc. IEEE ICASSP, 5959-5963 (2018).
  26. S. Sun, C. F. Yeh, M. Y. Hwang, M. Ostendorf, and L. Xie, "Domain adversarial training for accented speech recognition," Proc. IEEE ICASSP, 4854-4858 (2018).
  27. L. Van der Maaten and G. Hinton, "Visualizing data using t-SNE," J. Mach. Learn. Res. 9, 2579-2605 (2008).