DOI QR코드

DOI QR Code

Attention-long short term memory 기반의 화자 임베딩과 I-vector를 결합한 원거리 및 잡음 환경에서의 화자 검증 알고리즘

Speaker verification system combining attention-long short term memory based speaker embedding and I-vector in far-field and noisy environments

  • 배아라 (인천대학교 컴퓨터공학부) ;
  • 김우일 (인천대학교 컴퓨터공학부)
  • Bae, Ara ;
  • Kim, Wooil (Department of Computer Science and Engineering, Incheon National University)
  • 투고 : 2020.01.21
  • 심사 : 2020.03.20
  • 발행 : 2020.03.31

초록

문장 종속 짧은 발화에서 문장 독립 긴 발화까지 다양한 환경에서 I-vector 특징에 기반을 둔 많은 연구가 수행되었다. 본 논문에서는 원거리 잡음 환경에서 녹음한 데이터에서 Probabilistic Linear Discriminant Analysis(PLDA)를 적용한 I-vector와 주의 집중 기법을 접목한 Long Short Term Memory(LSTM) 기반의 화자 임베딩을 추출하여 결합한 화자 검증 알고리즘을 소개한다. LSTM 모델의 Equal Error Rate(EER)이 15.52 %, Attention-LSTM 모델이 8.46 %로 7.06 % 성능이 향상되었다. 이로써 본 논문에서 제안한 기법이 임베딩을 휴리스틱 하게 정의하여 사용하는 기존 추출방법의 문제점을 해결할 수 있는 것을 확인하였다. PLDA를 적용한 I-vector의 EER이 6.18 %로 결합 전 가장 좋은 성능을 보였다. Attention-LSTM 기반 임베딩과 결합하였을 때 EER이 2.57 %로 기존보다 3.61 % 감소하여 상대적으로 58.41 % 성능이 향상되었다.

Many studies based on I-vector have been conducted in a variety of environments, from text-dependent short-utterance to text-independent long-utterance. In this paper, we propose a speaker verification system employing a combination of I-vector with Probabilistic Linear Discriminant Analysis (PLDA) and speaker embedding of Long Short Term Memory (LSTM) with attention mechanism in far-field and noisy environments. The LSTM model's Equal Error Rate (EER) is 15.52 % and the Attention-LSTM model is 8.46 %, improving by 7.06 %. We show that the proposed method solves the problem of the existing extraction process which defines embedding as a heuristic. The EER of the I-vector/PLDA without combining is 6.18 % that shows the best performance. And combined with attention-LSTM based embedding is 2.57 % that is 3.61 % less than the baseline system, and which improves performance by 58.41 %.

키워드

참고문헌

  1. P. Kenny, G. Boulianne, P. Oullet, and P. Dumouchel, "Joint factor analysis versus eigenchannes in speaker recognition," IEEE Trans on. Audio, Speech, and Language Processing, 15, 2072-2084 (2007). https://doi.org/10.1109/TASL.2007.902870
  2. D. A. Reynolds, T. F. Quatieri, and R. B. Dunn, "Speaker verification using adapted gaussian mixture models," Digital Signal Processing, 10, 19-41 (2000). https://doi.org/10.1006/dspr.1999.0361
  3. N. Dehak, P. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet, "Front-end factor analysis for speaker verification," IEEE Trans on. Audio, Speech, and Language Processing, 19, 788-798 (2011). https://doi.org/10.1109/TASL.2010.2064307
  4. E. Variani, X. Lei, E. McDermott, I. Lopez-Moreno, and J. Gonzalez Dominguez, "Deep neural networks for small footprint text-dependent speaker verification," Proc. ICASSP. 4080-4084 (2014).
  5. V. Peddinti, D. Povey, and S. Khudanpur, "A time delay neural network architecture for efficient modeling of long temporal contexts," Proc. Interspeech, 3214-3218 (2015).
  6. Y. Liu, Y. Qian, N. Chen, T. Fu, Y. Zhang, and K. Yu, "Deep feature for text-dependent speaker verification," Speech Communication, 73, 1-13 (2015). https://doi.org/10.1016/j.specom.2015.07.003
  7. D. Snyder, D. Garcia-Romero, D. Povey, and S. Khudanpur, "Deep neural network embeddings for text-independent speaker verification," Proc. Interspeech, 999-1003 (2017).
  8. G. Heigold, I. Moreno, S. Bengio, and N. Shazeer, "End-toend text-dependent speaker verification," Proc. IEEE ICASSP. 5115-5119 (2016).
  9. D. Bahdanau, K. Cho, and Y. Bengio. "Neural machine translation by jointly learning to align and translate," arXiv preprint arXiv:1409.0473 (2014).
  10. S. J. D. Prince and J. H. Elder, "Probabilistic linear discriminant analysis for inferences about identity," Proc. IEEE 11th ICCV. 1-8 (2007).
  11. B. Fauve, N. Evans, and J. Mason, "Improving the performance of text-independent short duration SVMand GMM based speaker verification," Proc. Odyssey, Stellenbosch, 18 (2008).
  12. F. Chowdhury, Q. Wang, I. L. Moreno, and L. Wan, "Attention-based models for text-dependent speaker verification," arXiv preprint arXiv:1710.10470 (2017).
  13. L. Wan, Q. Wang, A. Papir, and I. L. Moreno, "Generalized end-to-end loss for speaker verification," arXiv preprint rXiv:1710.10467 (2017).
  14. Q. Wang, C. Downey, L. Wan, P. A. Mansfield, and I. L. Moreno, "Speaker diarization with lstm," Proc. ICASSP. 5239-5243 (2018).