Browse > Article
http://dx.doi.org/10.13064/KSSS.2017.9.2.085

Short utterance speaker verification using PLDA model adaptation and data augmentation  

Yoon, Sung-Wook (충북대학교 제어로봇공학전공)
Kwon, Oh-Wook (충북대학교)
Publication Information
Phonetics and Speech Sciences / v.9, no.2, 2017 , pp. 85-94 More about this Journal
Abstract
Conventional speaker verification systems using time delay neural network, identity vector and probabilistic linear discriminant analysis (TDNN-Ivector-PLDA) are known to be very effective for verifying long-duration speech utterances. However, when test utterances are of short duration, duration mismatch between enrollment and test utterances significantly degrades the performance of TDNN-Ivector-PLDA systems. To compensate for the I-vector mismatch between long and short utterances, this paper proposes to use probabilistic linear discriminant analysis (PLDA) model adaptation with augmented data. A PLDA model is trained on vast amount of speech data, most of which have long duration. Then, the PLDA model is adapted with the I-vectors obtained from short-utterance data which are augmented by using vocal tract length perturbation (VTLP). In computer experiments using the NIST SRE 2008 database, the proposed method is shown to achieve significantly better performance than the conventional TDNN-Ivector-PLDA systems when there exists duration mismatch between enrollment and test utterances.
Keywords
time delay neural network (TDNN); identity vector (I-vector); probabilistic linear discriminant analysis (PLDA); vocal tract length perturbation (VTLP);
Citations & Related Records
연도 인용수 순위
  • Reference
1 Kanagasundaram, A., Dean, D., Sridharan, S., Gonzalez- Dominguez, J., Gonzalez-Rodriguez, J., & Ramos, D. (2014). Improving short utterance i-vector speaker verification using utterance variance modelling and compensation techniques. Speech Communication, 59, 69-82.   DOI
2 Kenny, P., Stafylakis, T., Ouellet, P., Alam, M. J., & Dumouchel, P. (2013). PLDA for speaker verification with utterances of arbitrary duration. Proceedings of International Conference on Acoustics, Speech and Signal Processing. May, 2013.
3 Garcia-Romero, D., McCree, A., Shum, S., Brummer, N., & Vaquero, C. (2014). Unsupervised domain adaptation for i-vector speaker recognition. Proceedings of Odyssey: The Speaker and Language Recognition Workshop. June, 2014.
4 Dehak, N., Dehak, R., Kenny, P., Brümmer, N., Ouellet, P., & Dumouchel, P. (2009). Support vector machines versus fast scoring in the low-dimensional total variability space for speaker verification. Proceedings of INTERSPEECH. September, 2009.
5 Peddinti, V., Povey, D., & Khudanpur, S. (2015). A time delay neural network architecture for efficient modeling of long temporal contexts. Proceedings of INTERSPEECH. 2015.
6 Waibel, A., Hanazawa, T., Hinton, G., Shikano, K., & Lang, K. J. (1989). Phoneme recognition using time-delay neural networks. IEEE Transactions on Acoustics, Speech, and Signal Processing, 37(3), 328-339.   DOI
7 Kenny, P., Ouellet, P., Dehak, N., Gupta, V., & Dumouchel, P. (2008). A study of interspeaker variability in speaker verification. IEEE Transactions on Audio, Speech, and Language Processing, 16(5), 980-988.   DOI
8 Reynolds, D. A., Quatieri, T. F., & Dunn, R. B. (2000). Speaker verification using adapted Gaussian mixture models. Digital Signal Processing, 10(1), 19-41.   DOI
9 Paul, D. B., & Baker, J. M. (1992). The design for the wall street journal based csr corpus. Proceedings of the workshop on Speech and Natural Language (pp. 357-362).
10 Kenny, P., Gupta, V., Stafylakis, T., Ouellet, P., & Alam, J. (2014). Deep neural networks for extracting baum-welch statistics for speaker recognition. Proceedings of Odyssey: The Speaker and Language Recognition Workshop. June, 2014.
11 Pitz, M., & Ney, H. (2005). Vocal tract normalization equals linear transformation in cepstral space. IEEE Transactions on Speech and Audio Processing, 13(5), 930-944.   DOI
12 Molau, S., Kanthak, S., & Ney, H. (2000). Efficient vocal tract normalization in automatic speech recognition. Proceedings of the ESSV'00. 2000.
13 Poddar, A., Sahidullah, M., & Saha, G. (2015). Performance comparison of speaker recognition systems in presence of duration variability. Proceedings of IEEE India Conference(INDICON). December, 2015.
14 Jaitly, N., & Hinton, G. E. (2013). Vocal tract length perturbation (VTLP) improves speech recognition. Proceedings of ICML Workshop on Deep Learning for Audio, Speech and Language. June, 2013.
15 Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., Hannemann, M., Motlicek, P., Qian, Y., Schwarz, P., Silovsky, J., Stemmer, G., & vesely, K. (2011). The Kaldi speech recognition toolkit. Proceedings of IEEE Workshop on Automatic Speech Recognition and Understanding. 2011.
16 Cieri, C., Miller, D., & Walker, K. (2004). The fisher corpus: resource for the next generations of speech-to-text. Language Resources and Evaluation Conference, 4, 69-71.
17 Reynolds, D. A., Quatieri, T. F., & Dunn, R. B. (2000). Speaker verification using adapted Gaussian mixture models. Digital Signal Processing, 10(1), 19-41.   DOI
18 Kenny, P., Boulianne, G., Ouellet, P.& Dumouchel, P. (2007). Speaker and session variability in GMM-based speaker verification. IEEE Transactions on Audio, Speech and Language Processing, 15(4), 1448-1460.   DOI
19 National Institute of Standards and Technology. (2008). The NIS T year 2008 speaker recognition evaluation plan 2008. Retrieved from http://www.itl.nist.gov/iad/mig/tests/sre/2008/sre08_evalplan_release4.pdf on December 11, 2016.
20 Snyder, D., Garcia-Romero, D., & Povey, D. (2015). Time delay deep neural network-based universal background models for speaker recognition. Proceedings of IEEE Workshop on Automatic Speech Recognition and Understanding. December, 2015.
21 Dehak, N., Kenny, P. J., Dehak, R., Dumouchel, P., & Ouellet, P. (2011). Front-end factor analysis for speaker verification. IEEE Transactions on Audio, Speech, and Language Processing, 19(4), 788-798.   DOI
22 Hasan, T., Saeidi, R., Hansen, J. H., & van Leeuwen, D. A. (2013). Duration mismatch compensation for i-vector based speaker recognition systems. Proceedings of International Conference on Acoustics, Speech and Signal Processing. May, 2013.
23 Prince, S. J., & Elder, J. H. (2007). Probabilistic linear discriminant analysis for inferences about identity. Proceedings of the 11th IEEE International Conference on Computer Vision. October, 2007.
24 Garcia-Romero, D., Zhang, X., McCree, A., & Povey, D. (2014). Improving speaker recognition performance in the domain adaptation challenge using deep neural networks. Proceedings of Spoken Language Technology Workshop. December, 2014.
25 Lei, Y., Scheffer, N., Ferrer, L., & McLaren, M. (2014). A novel scheme for speaker recognition using a phonetically-aware deep neural network. Proceedings of International Conference on Acoustics, Speech and Signal Processing. May, 2014.
26 Kanagasundaram, A., Vogt, R. J., Dean, D. B., & Sridharan, S. (2012). PLDA based speaker recognition on short utterances. Proceedings of Odyssey: The Speaker and Language Recognition Workshop. June, 2012.