DOI QR코드

DOI QR Code

Deep neural networks for speaker verification with short speech utterances

짧은 음성을 대상으로 하는 화자 확인을 위한 심층 신경망

  • 양일호 (서울시립대학교 컴퓨터과학부) ;
  • 허희수 (서울시립대학교 컴퓨터과학부) ;
  • 윤성현 (서울시립대학교 컴퓨터과학부) ;
  • 유하진 (서울시립대학교 컴퓨터과학부)
  • Received : 2016.09.01
  • Accepted : 2016.11.25
  • Published : 2016.11.30

Abstract

We propose a method to improve the robustness of speaker verification on short test utterances. The accuracy of the state-of-the-art i-vector/probabilistic linear discriminant analysis systems can be degraded when testing utterance durations are short. The proposed method compensates for utterance variations of short test feature vectors using deep neural networks. We design three different types of DNN (Deep Neural Network) structures which are trained with different target output vectors. Each DNN is trained to minimize the discrepancy between the feed-forwarded output of a given short utterance feature and its original long utterance feature. We use short 2-10 s condition of the NIST (National Institute of Standards Technology, U.S.) 2008 SRE (Speaker Recognition Evaluation) corpus to evaluate the method. The experimental results show that the proposed method reduces the minimum detection cost relative to the baseline system.

본 논문에서는 짧은 테스트 발성에 대한 화자 확인 성능을 개선하는 방법을 제안한다. 테스트 발성의 길이가 짧을 경우 i-벡터/확률적 선형판별분석 기반 화자 확인 시스템의 성능이 하락한다. 제안한 방법은 짧은 발성으로부터 추출한 특징 벡터를 심층 신경망으로 변환하여 발성 길이에 따른 변이를 보상한다. 이 때, 학습시의 출력 레이블에 따라 세 종류의 심층 신경망 이용 방법을 제안한다. 각 신경망은 입력 받은 짧은 발성 특징에 대한 출력 결과와 원래의 긴 발성으로부터 추출한 특징과의 차이를 줄이도록 학습한다. NIST (National Institute of Standards Technology, 미국) 2008 SRE(Speaker Recognition Evaluation) 코퍼스의 short 2-10 s 조건 하에서 제안한 방법의 성능을 평가한다. 실험 결과 부류 내 분산 정규화 및 선형 판별 분석을 이용하는 기존 방법에 비해 최소 검출 비용이 감소하는 것을 확인하였다. 또한 짧은 발성 분산 정규화 기반 방법과도 성능을 비교하였다.

Keywords

References

  1. P. Kenny, "Bayesian speaker verification with heavy tailed priors," in Proc. Odyssey, 61-70 (2010).
  2. A. Kanagasundaram, R. Vogt, D. Dean, S. Sridharan, and M. Mason, "i-vector based speaker recognition on short utterances," Interspeech, 2341-2344 (2011).
  3. A. K. Sarkar, D. Matrouf, P. M. Bousquet, and J. F. Bonastre, "Study of the effect of i-vector modeling on short and mismatch utterance duration for speaker verification," in Proc. Interspeech, 2662-2665 (2012).
  4. P. Kenny, T. Stafylakis, P. Ouellet, M. J. Alam, and P. Dumouchel, "Plda for speaker verification with utterances of arbitrary duration," ICASSP, 7649-7653 (2013).
  5. M. I. Mandasari, R. Saeidi, M. McLaren, and D. van Leeuwen, "Quality measure functions for calibration of speaker recognition systems in various duration conditions," IEEE Trans. on Audio, Speech, and Lang. Process. 21, 2425-2438 (2013). https://doi.org/10.1109/TASL.2013.2279332
  6. A. Kanagasundaram, D. Dean, S. Sridharan, J. Gonzalez-Dominguez, J. Gonzalez-Rodriguez, and D. Ramos, "Improving short utterance i-vector speaker verification using utterance variance modelling and compensation techniques," Speech Communication 59, 69-82 (2014). https://doi.org/10.1016/j.specom.2014.01.004
  7. E. Variani, X. Lei, E. McDermott, I. Lopez-Moreno, and J. Gonzalez-Dominguez, "Deep neural networks for small footprint text-dependent speaker verification," ICASSP, 4052-4056 (2014).
  8. Y. Lei, N. Scheffer, L. Ferrer, and M. McLaren, "A novel scheme for speaker recognition using a phonetically aware deep neural network," ICASSP, 1695-1699 (2014).
  9. The NIST Year 2008 Speaker Recognition Evaluation Plan, http://www.itl.nist.gov/iad/mig/tests/sre/2008/sre08_evalplan_release4.pdf, 2008.
  10. D. Garcia-Romero, and C. Y. Espy-Wilson. "Analysis of i-vector length normalization in speaker recognition systems," Interspeech, 249-252 (2011).
  11. N. Dehak, R. Dehak, J. R. Glass, D. A. Reynolds, and P. Kenny, "Cosine similarity scoring without score normalization techniques," in Proc. Odyssey, 71-75 (2010).
  12. V. Nair and G. E. Hinton, "Rectified linear units improve restricted boltzmann machines," ICML, 807-814 (2010).
  13. The Kaldi Speech Recognition Toolkit, http://kaldi-asr.org/doc/about.html, 2011.
  14. Blocks and Fuel: Frameworks for Deep Learning, https://arxiv.org/abs/1506.00619, 2015.
  15. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift, https://arxiv.org/abs/1502.03167, 2015.
  16. Improving Neural Networks by Preventing Co-Adaptation of Feature Detectors, https://arxiv.org/abs/1207.0580, 2012.
  17. G. E. Hinton and R. R. Salakhutdinov, "Reducing the dimensionality of data with neural networks," Science 313, 504-507 (2006). https://doi.org/10.1126/science.1127647