DOI QR코드

DOI QR Code

Short utterance speaker verification using PLDA model adaptation and data augmentation

PLDA 모델 적응과 데이터 증강을 이용한 짧은 발화 화자검증

  • Received : 2017.01.26
  • Accepted : 2017.06.08
  • Published : 2017.06.30

Abstract

Conventional speaker verification systems using time delay neural network, identity vector and probabilistic linear discriminant analysis (TDNN-Ivector-PLDA) are known to be very effective for verifying long-duration speech utterances. However, when test utterances are of short duration, duration mismatch between enrollment and test utterances significantly degrades the performance of TDNN-Ivector-PLDA systems. To compensate for the I-vector mismatch between long and short utterances, this paper proposes to use probabilistic linear discriminant analysis (PLDA) model adaptation with augmented data. A PLDA model is trained on vast amount of speech data, most of which have long duration. Then, the PLDA model is adapted with the I-vectors obtained from short-utterance data which are augmented by using vocal tract length perturbation (VTLP). In computer experiments using the NIST SRE 2008 database, the proposed method is shown to achieve significantly better performance than the conventional TDNN-Ivector-PLDA systems when there exists duration mismatch between enrollment and test utterances.

Keywords

References

  1. Reynolds, D. A., Quatieri, T. F., & Dunn, R. B. (2000). Speaker verification using adapted Gaussian mixture models. Digital Signal Processing, 10(1), 19-41. https://doi.org/10.1006/dspr.1999.0361
  2. Dehak, N., Kenny, P. J., Dehak, R., Dumouchel, P., & Ouellet, P. (2011). Front-end factor analysis for speaker verification. IEEE Transactions on Audio, Speech, and Language Processing, 19(4), 788-798. https://doi.org/10.1109/TASL.2010.2064307
  3. Prince, S. J., & Elder, J. H. (2007). Probabilistic linear discriminant analysis for inferences about identity. Proceedings of the 11th IEEE International Conference on Computer Vision. October, 2007.
  4. Garcia-Romero, D., Zhang, X., McCree, A., & Povey, D. (2014). Improving speaker recognition performance in the domain adaptation challenge using deep neural networks. Proceedings of Spoken Language Technology Workshop. December, 2014.
  5. Lei, Y., Scheffer, N., Ferrer, L., & McLaren, M. (2014). A novel scheme for speaker recognition using a phonetically-aware deep neural network. Proceedings of International Conference on Acoustics, Speech and Signal Processing. May, 2014.
  6. Hasan, T., Saeidi, R., Hansen, J. H., & van Leeuwen, D. A. (2013). Duration mismatch compensation for i-vector based speaker recognition systems. Proceedings of International Conference on Acoustics, Speech and Signal Processing. May, 2013.
  7. Kanagasundaram, A., Vogt, R. J., Dean, D. B., & Sridharan, S. (2012). PLDA based speaker recognition on short utterances. Proceedings of Odyssey: The Speaker and Language Recognition Workshop. June, 2012.
  8. Kanagasundaram, A., Dean, D., Sridharan, S., Gonzalez- Dominguez, J., Gonzalez-Rodriguez, J., & Ramos, D. (2014). Improving short utterance i-vector speaker verification using utterance variance modelling and compensation techniques. Speech Communication, 59, 69-82. https://doi.org/10.1016/j.specom.2014.01.004
  9. Kenny, P., Stafylakis, T., Ouellet, P., Alam, M. J., & Dumouchel, P. (2013). PLDA for speaker verification with utterances of arbitrary duration. Proceedings of International Conference on Acoustics, Speech and Signal Processing. May, 2013.
  10. Garcia-Romero, D., McCree, A., Shum, S., Brummer, N., & Vaquero, C. (2014). Unsupervised domain adaptation for i-vector speaker recognition. Proceedings of Odyssey: The Speaker and Language Recognition Workshop. June, 2014.
  11. Dehak, N., Dehak, R., Kenny, P., Brümmer, N., Ouellet, P., & Dumouchel, P. (2009). Support vector machines versus fast scoring in the low-dimensional total variability space for speaker verification. Proceedings of INTERSPEECH. September, 2009.
  12. Peddinti, V., Povey, D., & Khudanpur, S. (2015). A time delay neural network architecture for efficient modeling of long temporal contexts. Proceedings of INTERSPEECH. 2015.
  13. Waibel, A., Hanazawa, T., Hinton, G., Shikano, K., & Lang, K. J. (1989). Phoneme recognition using time-delay neural networks. IEEE Transactions on Acoustics, Speech, and Signal Processing, 37(3), 328-339. https://doi.org/10.1109/29.21701
  14. Kenny, P., Ouellet, P., Dehak, N., Gupta, V., & Dumouchel, P. (2008). A study of interspeaker variability in speaker verification. IEEE Transactions on Audio, Speech, and Language Processing, 16(5), 980-988. https://doi.org/10.1109/TASL.2008.925147
  15. Reynolds, D. A., Quatieri, T. F., & Dunn, R. B. (2000). Speaker verification using adapted Gaussian mixture models. Digital Signal Processing, 10(1), 19-41. https://doi.org/10.1006/dspr.1999.0361
  16. Kenny, P., Gupta, V., Stafylakis, T., Ouellet, P., & Alam, J. (2014). Deep neural networks for extracting baum-welch statistics for speaker recognition. Proceedings of Odyssey: The Speaker and Language Recognition Workshop. June, 2014.
  17. Paul, D. B., & Baker, J. M. (1992). The design for the wall street journal based csr corpus. Proceedings of the workshop on Speech and Natural Language (pp. 357-362).
  18. Pitz, M., & Ney, H. (2005). Vocal tract normalization equals linear transformation in cepstral space. IEEE Transactions on Speech and Audio Processing, 13(5), 930-944. https://doi.org/10.1109/TSA.2005.848881
  19. Molau, S., Kanthak, S., & Ney, H. (2000). Efficient vocal tract normalization in automatic speech recognition. Proceedings of the ESSV'00. 2000.
  20. Jaitly, N., & Hinton, G. E. (2013). Vocal tract length perturbation (VTLP) improves speech recognition. Proceedings of ICML Workshop on Deep Learning for Audio, Speech and Language. June, 2013.
  21. Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., Hannemann, M., Motlicek, P., Qian, Y., Schwarz, P., Silovsky, J., Stemmer, G., & vesely, K. (2011). The Kaldi speech recognition toolkit. Proceedings of IEEE Workshop on Automatic Speech Recognition and Understanding. 2011.
  22. Cieri, C., Miller, D., & Walker, K. (2004). The fisher corpus: resource for the next generations of speech-to-text. Language Resources and Evaluation Conference, 4, 69-71.
  23. Poddar, A., Sahidullah, M., & Saha, G. (2015). Performance comparison of speaker recognition systems in presence of duration variability. Proceedings of IEEE India Conference(INDICON). December, 2015.
  24. Kenny, P., Boulianne, G., Ouellet, P.& Dumouchel, P. (2007). Speaker and session variability in GMM-based speaker verification. IEEE Transactions on Audio, Speech and Language Processing, 15(4), 1448-1460. https://doi.org/10.1109/TASL.2007.894527
  25. National Institute of Standards and Technology. (2008). The NIS T year 2008 speaker recognition evaluation plan 2008. Retrieved from http://www.itl.nist.gov/iad/mig/tests/sre/2008/sre08_evalplan_release4.pdf on December 11, 2016.
  26. Snyder, D., Garcia-Romero, D., & Povey, D. (2015). Time delay deep neural network-based universal background models for speaker recognition. Proceedings of IEEE Workshop on Automatic Speech Recognition and Understanding. December, 2015.