DOI QR코드

DOI QR Code

A study on speech disentanglement framework based on adversarial learning for speaker recognition

화자 인식을 위한 적대학습 기반 음성 분리 프레임워크에 대한 연구

  • 권유환 (연세대학교 전기전자공학부) ;
  • 정수환 (연세대학교 전기전자공학부) ;
  • 강홍구 (연세대학교 전기전자공학부)
  • Received : 2020.07.31
  • Accepted : 2020.09.16
  • Published : 2020.09.30

Abstract

In this paper, we propose a system to extract effective speaker representations from a speech signal using a deep learning method. Based on the fact that speech signal contains identity unrelated information such as text content, emotion, background noise, and so on, we perform a training such that the extracted features only represent speaker-related information but do not represent speaker-unrelated information. Specifically, we propose an auto-encoder based disentanglement method that outputs both speaker-related and speaker-unrelated embeddings using effective loss functions. To further improve the reconstruction performance in the decoding process, we also introduce a discriminator popularly used in Generative Adversarial Network (GAN) structure. Since improving the decoding capability is helpful for preserving speaker information and disentanglement, it results in the improvement of speaker verification performance. Experimental results demonstrate the effectiveness of our proposed method by improving Equal Error Rate (EER) on benchmark dataset, Voxceleb1.

본 논문은 딥러닝 기법을 활용하여 음성신호로부터 효율적인 화자 벡터를 추출하는 시스템을 제안한다. 음성신호에는 발화내용, 감정, 배경잡음 등과 같이 화자의 특징과는 관련이 없는 정보들이 포함되어 있다는 점에 착안하여 제안 방법에서는 추출된 화자 벡터에 화자의 특징과 관련된 정보는 가능한 많이 포함되고, 그렇지 않은 비화자 정보는 최소화될 수 있도록 학습을 진행한다. 특히, 오토-인코더 구조의 부호화 기가 두 개의 임베딩 벡터를 추정하도록 하고, 효과적인 손실 함수 조건을 두어 각 임베딩이 화자 및 비화자 특징만 각각 포함할 수 있도록 하는 효과적인 화자 정보 분리(disentanglement)방법을 제안한다. 또한, 화자 정보를 유지하는데 도움이 되는 생성적 적대 신경망(Generative Adversarial Network, GAN)에서 활용되는 판별기 구조를 도입함으로써, 디코더의 성능을 향상시킴으로써 화자 인식 성능을 보다 향상시킨다. 제안된 방법에 대한 적절성과 효율성은 벤치마크 데이터로 사용되고 있는 Voxceleb1에 대한 동일오류율(Equal Error Rate, EER) 개선 실험을 통하여 규명하였다.

Keywords

References

  1. E. Variani, X. Lei, E. McDermott, I. L. Moreno, and J. Gonzalez-Dominguez, "Deep neural networks for small footprint text dependent speaker verification," Proc. IEEE ICASSP. 4052-4056 (2014).
  2. D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudanpur, "X-vectors: Robust dnn embeddings for speaker recognition," Proc. IEEE ICASSP. 5329-5333 (2018).
  3. T. Jianwei, J. Xiaoqi, H. Qingjia, Z. Weijuan, and Z. Shengzhi, "SEF-ALDR: A speaker embedding framework via adversarial learning based disentangled representation," arXiv preprint arXiv:1912.02608 (2020).
  4. C. Li, M. Xiaokong, J. Bing, L. Xiangang, Z. Xuewei, L. Xiao, C. Ying, K. Ajay, and Z. Zhenyao, "Deep speaker: an end-to-end neural speaker embedding system," arXiv preprint arXiv:1705.02304 650 (2017).
  5. I. Kim, K. Kim, J. Kim, and C. Choi, "Deep speaker representation using orthogonal decomposition and recombination for speaker verification," Proc. IEEE ICASSP. 6126-6130 (2019).
  6. I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, "Generative adversarial nets," Advances in NIPS. 2672-2680 (2014).
  7. W. Ding and L. He, "MTGAN: Speaker verification through multitasking triplet generative adversarial networks," arXiv preprint arXiv: 1803.09059 (2018).
  8. Y. Liu, Z. Wang, H. Jin, and I. Wassell, "Multi-task adversarial network for disentangled feature learning." Proc. IEEE CVPR. 3743-3751 (2018).
  9. J. S. Chung, N. Arsha, and A. Zisserman, "Voxceleb2: deep speaker recognition," arXiv preprint arXiv:1806.05622 (2018).
  10. N. Arsha, J. S. Chung, and A. Zisserman, "Voxceleb: a large-scale speaker identification dataset," arXiv preprint arXiv:1706.08612 (2017).
  11. K. He, X. Zhang, S. Ren, and J. Sun, "Deep residual learning for image recognition," Proc. IEEE CVPR. 770-778 (2016).
  12. A. Radford, L. Metz, and S. Chintala, "Unsupervised representation learning with deep convolutional generative adversarial networks," arXiv preprint arXiv: 1511.06434 (2015).
  13. W. Cai, J. Chen, and M. Li, "Exploring the encoding layer and loss function in end-to-end speaker and language recognition system," arXiv preprint arXiv: 1804.05160 (2018).
  14. W. Xie, A. Nagrani, J. S. Chung, and A. Zisserman, "Utterance-level aggregation for speaker recognition in the wild," Proc. IEEE ICASSP. 5791-5795 (2019).
  15. L. V. D. Maaten and G. Hinton, "Visualizing data using t-SNE," J. Machine Learning Research, 9, 2579-2605 (2008).