DOI QR코드

DOI QR Code

Triplet loss based domain adversarial training for robust wake-up word detection in noisy environments

잡음 환경에 강인한 기동어 검출을 위한 삼중항 손실 기반 도메인 적대적 훈련

  • 임형준 (한국과학기술원 전기및전자공학부) ;
  • 정명훈 (한국과학기술원 전기및전자공학부) ;
  • 김회린 (한국과학기술원 전기및전자공학부)
  • Received : 2020.07.17
  • Accepted : 2020.08.29
  • Published : 2020.09.30

Abstract

A good acoustic word embedding that can well express the characteristics of word plays an important role in wake-up word detection (WWD). However, the representation ability of acoustic word embedding may be weakened due to various types of environmental noise occurred in the place where WWD works, causing performance degradation. In this paper, we proposed triplet loss based Domain Adversarial Training (tDAT) mitigating environmental factors that can affect acoustic word embedding. Through experiments in noisy environments, we verified that the proposed method effectively improves the conventional DAT approach, and checked its scalability by combining with other method proposed for robust WWD.

단어의 특성을 잘 표현하는 음성 단어 임베딩은 기동어 인식에서 중요한 역할을 한다. 하지만 기동어 인식이 수행되는 환경에서 필연적으로 발생하는 다양한 종류의 잡음으로 인해 음성 단어 임베딩의 표현 능력이 손상될 수 있으며, 인식 성능의 저하를 초래할 수 있다. 본 논문에서는 음성 단어 임베딩에 영향을 줄 수 있는 환경적인 요인을 완화시키는 삼중항 손실 기반의 도메인 적대적 훈련 방식을 제안한다. 잡음 환경에서의 기동어 검출 실험을 통해 제안하는 방식이 기존의 도메인 적대적 훈련 방식을 효과적으로 개선하는 모습을 확인할 수 있었고, 잡음 환경에서의 기동어 검출을 위해 기존에 제안된 다른 방법과의 결합을 통해 제안하는 방식의 확장성을 확인할 수 있었다.

Keywords

References

  1. Y. Zhang and J. R. Glass "Unsupervised spoken keyword spotting via segmental DTW on Gaussian posteriorgrams," Proc. ASRU. 398-403 (2009).
  2. G. Mantena a nd K . Prahallad, " Use of articulatory bottle-neck features for query-by-example spoken term detection in low resource scenarios," Proc. ICASSP. 7128-7132 (2014).
  3. H. Lim, Y. Kim, Y. Kim, and H. Kim, "CNN-based bottleneck feature for noise robust query-by-example spoken term detection," Proc. APSIPA. 1278-1281 (2017).
  4. G. Chen, C. Parada, and T. N. Sainath, "Query-byexample keyword spotting using long short-term memory networks," Proc. ICASSP. 5236-5240 (2015).
  5. S. Settle and K. Livescu, "Discriminative acoustic word embeddings: Recurrent neural network-based approaches," Proc. SLT. 503-510 (2016).
  6. M. Jung, H. Lim, J. Goo, Y. Jung, and H. Kim, "Additional shared decoder on Siamese multi-view encoders for learning acoustic word embeddings," Proc. ASRU. 629-636 (2019).
  7. H. Lim, Y. Kim, J. Goo, and H. Kim, "Interlayer selective attention network for robust personalized wake-up word detection," IEEE Signal Process. Lett. 27, 126-130 (2020). https://doi.org/10.1109/LSP.2019.2959902
  8. Y. Ganin, H. Ajakan, H. Larochelle, F. Laviolette, and V. Lempitsky, "Domain-adversarial training of neural networks," J. Mach. Learn. Res. 17, 2096-2030 (2016).
  9. E. Tzeng, J. Hoffman, K. Saenko, and T. Darrell, "Adversarial discriminative domain adaptation," Proc. CVPR. 7167-7176 (2017).
  10. Z. Pei, Z. Cao, M. Long, and J. Wang, "Multi-adversarial domain adaptation," Proc. AAAI. 3934-3941 (2018).
  11. R. Wang, M. Utiyama, A. Finch, L. Liu, K. Chen, and E. Sumita, "Sentence selection and weighting for neural machine translation domain adaptation," IEEE/ ACM Trans. Audio, Speech, Lang. Process. 26, 1727-1741 (2018). https://doi.org/10.1109/TASLP.2018.2837223
  12. A. Tripathi, A. Mohan, S. Anand, and M. Singh, "Adversarial learning of raw speech features for domain invariant speech recognition," Proc. ICASSP. 5959-5963 (2018).
  13. S. Sun, C. F. Yeh, M. Y. Hwang, M. Ostendorf, and L. Xie, "Domain adversarial training for accented speech recognition," Proc. ICASSP. 4854-4858 (2018).
  14. S. Mirsamadi and J. H. Hansen, "Multi-domain adversarial training of neural network acoustic models for distant speech recognition," Speech Commun. 106, 21-30 (2019). https://doi.org/10.1016/j.specom.2018.10.010
  15. Y. Ganin and V. Lempitsky, "Unsupervised domain adaptation by backpropagation," Proc. ICML. 1180-1189 (2015).
  16. D. B. Paul and J. M. Baker, "The design for the Wall Street Journal-based CSR corpus," Proc. Workshop Speech and Natural Lang. 357-362 (1992).
  17. D. Dean, S. Sridharan, R. Vogt, and M. Mason, "The QUT-NOISE-TIMIT corpus for the evaluation of voice activity detection algorithms," Proc. Interspeech, 3110-3113 (2010).
  18. H. G. Hirsch and D. Pearce, "The Aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions," Proc. ISCA ITRW ASR. 181-188 (2000).
  19. M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, M. Kudlur, J. Lenvnberg, R. Monga, S. Moore, D. G. Murray, B. Steiner, P. Tucker, V. Vasudevan, P. Warden, M. Wiche, Y. Yu, and X. Zheng, "Tensor Flow: Large-scale machine learning on heterogeneous systems," Proc. USENIX OSDI. 265-283 (2016).
  20. D. Kingma and J. Ba, "Adam: A method for stochastic optimization," Proc. ICLR. 1-15 (2015).
  21. K. Hajian-Tilaki, "Receiver operating characteristic (ROC) curve analysis for medical diagnostic test evaluation," Caspian J. Intern. Med. 4, 627-635 (2013).