Short utterance speaker verification using PLDA model adaptation and data augmentation

Yoon, Sung-Wook;Kwon, Oh-Wook;

doi:10.13064/KSSS.2017.9.2.085

Phonetics and Speech Sciences (말소리와 음성과학)

Volume 9 Issue 2
/
Pages.85-94
/
2017
/
2005-8063(pISSN)
/
2586-5854(eISSN)

Korean Society of Speech Sciences (한국음성학회)

DOI QR Code

Short utterance speaker verification using PLDA model adaptation and data augmentation

PLDA 모델 적응과 데이터 증강을 이용한 짧은 발화 화자검증

윤성욱 (충북대학교 제어로봇공학전공) ;
권오욱 (충북대학교)

Received : 2017.01.26
Accepted : 2017.06.08
Published : 2017.06.30

https://doi.org/10.13064/KSSS.2017.9.2.085 Citation PDF KSCI

Download PDF

⟨ Previous Next ⟩

Abstract

Conventional speaker verification systems using time delay neural network, identity vector and probabilistic linear discriminant analysis (TDNN-Ivector-PLDA) are known to be very effective for verifying long-duration speech utterances. However, when test utterances are of short duration, duration mismatch between enrollment and test utterances significantly degrades the performance of TDNN-Ivector-PLDA systems. To compensate for the I-vector mismatch between long and short utterances, this paper proposes to use probabilistic linear discriminant analysis (PLDA) model adaptation with augmented data. A PLDA model is trained on vast amount of speech data, most of which have long duration. Then, the PLDA model is adapted with the I-vectors obtained from short-utterance data which are augmented by using vocal tract length perturbation (VTLP). In computer experiments using the NIST SRE 2008 database, the proposed method is shown to achieve significantly better performance than the conventional TDNN-Ivector-PLDA systems when there exists duration mismatch between enrollment and test utterances.

Keywords

References

Reynolds, D. A., Quatieri, T. F., & Dunn, R. B. (2000). Speaker verification using adapted Gaussian mixture models. Digital Signal Processing, 10(1), 19-41. https://doi.org/10.1006/dspr.1999.0361
Dehak, N., Kenny, P. J., Dehak, R., Dumouchel, P., & Ouellet, P. (2011). Front-end factor analysis for speaker verification. IEEE Transactions on Audio, Speech, and Language Processing, 19(4), 788-798. https://doi.org/10.1109/TASL.2010.2064307
Prince, S. J., & Elder, J. H. (2007). Probabilistic linear discriminant analysis for inferences about identity. Proceedings of the 11th IEEE International Conference on Computer Vision. October, 2007.
Garcia-Romero, D., Zhang, X., McCree, A., & Povey, D. (2014). Improving speaker recognition performance in the domain adaptation challenge using deep neural networks. Proceedings of Spoken Language Technology Workshop. December, 2014.
Lei, Y., Scheffer, N., Ferrer, L., & McLaren, M. (2014). A novel scheme for speaker recognition using a phonetically-aware deep neural network. Proceedings of International Conference on Acoustics, Speech and Signal Processing. May, 2014.
Hasan, T., Saeidi, R., Hansen, J. H., & van Leeuwen, D. A. (2013). Duration mismatch compensation for i-vector based speaker recognition systems. Proceedings of International Conference on Acoustics, Speech and Signal Processing. May, 2013.
Kanagasundaram, A., Vogt, R. J., Dean, D. B., & Sridharan, S. (2012). PLDA based speaker recognition on short utterances. Proceedings of Odyssey: The Speaker and Language Recognition Workshop. June, 2012.
Kanagasundaram, A., Dean, D., Sridharan, S., Gonzalez- Dominguez, J., Gonzalez-Rodriguez, J., & Ramos, D. (2014). Improving short utterance i-vector speaker verification using utterance variance modelling and compensation techniques. Speech Communication, 59, 69-82. https://doi.org/10.1016/j.specom.2014.01.004
Kenny, P., Stafylakis, T., Ouellet, P., Alam, M. J., & Dumouchel, P. (2013). PLDA for speaker verification with utterances of arbitrary duration. Proceedings of International Conference on Acoustics, Speech and Signal Processing. May, 2013.
Garcia-Romero, D., McCree, A., Shum, S., Brummer, N., & Vaquero, C. (2014). Unsupervised domain adaptation for i-vector speaker recognition. Proceedings of Odyssey: The Speaker and Language Recognition Workshop. June, 2014.
Dehak, N., Dehak, R., Kenny, P., Brümmer, N., Ouellet, P., & Dumouchel, P. (2009). Support vector machines versus fast scoring in the low-dimensional total variability space for speaker verification. Proceedings of INTERSPEECH. September, 2009.
Peddinti, V., Povey, D., & Khudanpur, S. (2015). A time delay neural network architecture for efficient modeling of long temporal contexts. Proceedings of INTERSPEECH. 2015.
Waibel, A., Hanazawa, T., Hinton, G., Shikano, K., & Lang, K. J. (1989). Phoneme recognition using time-delay neural networks. IEEE Transactions on Acoustics, Speech, and Signal Processing, 37(3), 328-339. https://doi.org/10.1109/29.21701
Kenny, P., Ouellet, P., Dehak, N., Gupta, V., & Dumouchel, P. (2008). A study of interspeaker variability in speaker verification. IEEE Transactions on Audio, Speech, and Language Processing, 16(5), 980-988. https://doi.org/10.1109/TASL.2008.925147
Reynolds, D. A., Quatieri, T. F., & Dunn, R. B. (2000). Speaker verification using adapted Gaussian mixture models. Digital Signal Processing, 10(1), 19-41. https://doi.org/10.1006/dspr.1999.0361
Kenny, P., Gupta, V., Stafylakis, T., Ouellet, P., & Alam, J. (2014). Deep neural networks for extracting baum-welch statistics for speaker recognition. Proceedings of Odyssey: The Speaker and Language Recognition Workshop. June, 2014.
Paul, D. B., & Baker, J. M. (1992). The design for the wall street journal based csr corpus. Proceedings of the workshop on Speech and Natural Language (pp. 357-362).
Pitz, M., & Ney, H. (2005). Vocal tract normalization equals linear transformation in cepstral space. IEEE Transactions on Speech and Audio Processing, 13(5), 930-944. https://doi.org/10.1109/TSA.2005.848881
Molau, S., Kanthak, S., & Ney, H. (2000). Efficient vocal tract normalization in automatic speech recognition. Proceedings of the ESSV'00. 2000.
Jaitly, N., & Hinton, G. E. (2013). Vocal tract length perturbation (VTLP) improves speech recognition. Proceedings of ICML Workshop on Deep Learning for Audio, Speech and Language. June, 2013.
Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., Hannemann, M., Motlicek, P., Qian, Y., Schwarz, P., Silovsky, J., Stemmer, G., & vesely, K. (2011). The Kaldi speech recognition toolkit. Proceedings of IEEE Workshop on Automatic Speech Recognition and Understanding. 2011.
Cieri, C., Miller, D., & Walker, K. (2004). The fisher corpus: resource for the next generations of speech-to-text. Language Resources and Evaluation Conference, 4, 69-71.
Poddar, A., Sahidullah, M., & Saha, G. (2015). Performance comparison of speaker recognition systems in presence of duration variability. Proceedings of IEEE India Conference(INDICON). December, 2015.
Kenny, P., Boulianne, G., Ouellet, P.& Dumouchel, P. (2007). Speaker and session variability in GMM-based speaker verification. IEEE Transactions on Audio, Speech and Language Processing, 15(4), 1448-1460. https://doi.org/10.1109/TASL.2007.894527
National Institute of Standards and Technology. (2008). The NIS T year 2008 speaker recognition evaluation plan 2008. Retrieved from http://www.itl.nist.gov/iad/mig/tests/sre/2008/sre08_evalplan_release4.pdf on December 11, 2016.
Snyder, D., Garcia-Romero, D., & Povey, D. (2015). Time delay deep neural network-based universal background models for speaker recognition. Proceedings of IEEE Workshop on Automatic Speech Recognition and Understanding. December, 2015.

Phonetics and Speech Sciences (말소리와 음성과학)

Short utterance speaker verification using PLDA model adaptation and data augmentation

PLDA 모델 적응과 데이터 증강을 이용한 짧은 발화 화자검증

Abstract

Keywords

References

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)