Browse > Article
http://dx.doi.org/10.4218/etrij.11.0110.0684

Text-Independent Speaker Verification Using Variational Gaussian Mixture Model  

Moattar, Mohammad Hossein (Department of Computer Engineering and Information Technology, Amirkabir University of Technology)
Homayounpour, Mohammad Mehdi (Department of Computer Engineering and Information Technology, Amirkabir University of Technology)
Publication Information
ETRI Journal / v.33, no.6, 2011 , pp. 914-923 More about this Journal
Abstract
This paper concerns robust and reliable speaker model training for text-independent speaker verification. The baseline speaker modeling approach is the Gaussian mixture model (GMM). In text-independent speaker verification, the amount of speech data may be different for speakers. However, we still wish the modeling approach to perform equally well for all speakers. Besides, the modeling technique must be least vulnerable against unseen data. A traditional approach for GMM training is expectation maximization (EM) method, which is known for its overfitting problem and its weakness in handling insufficient training data. To tackle these problems, variational approximation is proposed. Variational approaches are known to be robust against overtraining and data insufficiency. We evaluated the proposed approach on two different databases, namely KING and TFarsdat. The experiments show that the proposed approach improves the performance on TFarsdat and KING databases by 0.56% and 4.81%, respectively. Also, the experiments show that the variationally optimized GMM is more robust against noise and the verification error rate in noisy environments for TFarsdat dataset decreases by 1.52%.
Keywords
Gaussian mixture model; expectation maximization; variational approximation; speaker verification;
Citations & Related Records
Times Cited By KSCI : 1  (Citation Analysis)
Times Cited By Web Of Science : 0  (Related Records In Web of Science)
Times Cited By SCOPUS : 0
연도 인용수 순위
1 D.A. Reynolds, "Speaker Identification and Verification Using Gaussian Mixture Speaker Models," Speech Commun., vol. 17, no. 1-2, Aug. 1995, pp. 91-108.   DOI   ScienceOn
2 T. Kinnunen and H. Li, "An Overview of Text-Independent Speaker Recognition: From Features to Supervectors," Speech Commun., vol. 52, no. 1, Jan. 2010, pp. 12-40.   DOI   ScienceOn
3 K.H. You and H.C. Wang, "Joint Estimation of Feature Transformation Parameters and Gaussian Mixture Model for Speaker Identification," Speech Commun., vol. 28, no. 3, July 1999, pp. 227-241.   DOI   ScienceOn
4 K.R. Farrell, R. Mammone, and K. Assaleh, "Speaker Recognition Using Neural Networks and Conventional Classifiers," IEEE Trans. Speech, Audio Process., vol. 2, no. 1, 1994, pp. 194-205.   DOI   ScienceOn
5 Y. Gu and T. Thomas, "A Text-Independent Speaker Verification System Using Support Vector Machines Classifier," Proc. European Conf. Speech Commun. Technol., 2001, pp. 1765-1769.
6 L. Heck et al., "Robustness to Telephone Handset Distortion in Speaker Recognition by Discriminative Feature Design," Speech Commun., vol. 31, no. 2-3, 2000, pp. 181-192.   DOI   ScienceOn
7 F. Bimbot et al., "A Tutorial on Text-Independent Speaker Verification," EURASIP J. Appl. Signal Process., vol. 2004, no. 4, 2004, pp. 430-451.   DOI
8 W. Campbell et al., "Support Vector Machines for Speaker and Language Recognition," Comput. Speech Language, vol. 20, no. 2-3, 2006, pp. 210-229.   DOI   ScienceOn
9 W. Campbell, D. Sturim, and D. Reynolds, "Support Vector Machines using GMM Supervectors for Speaker Verification," IEEE Signal Process. Lett., vol. 13, no. 5, 2006, pp. 308-311.   DOI
10 E. Shriberg et al., "Modeling Prosodic Feature Sequences for Speaker Recognition," Speech Commun., vol. 46, no. 3-4, 2005, pp. 455-472.   DOI
11 L. Ferrer et al., "Parameterization of Prosodic Feature Distributions for SVM Modeling in Speaker Recognition," Proc. ICASSP, vol. 4, 2007, pp. 233-236.
12 W. Campbell et al., "Phonetic Speaker Recognition with Support Vector Machines," Adv. Neural Inf. Process. Syst., vol. 16, Cambridge, MA: MIT Press, 2004.
13 W.M. Campbell, "Generalized Linear Discriminant Sequence Kernels for Speaker Recognition," Proc. ICASSP, 2002.
14 V. Wan and S. Renals, "Speaker Verification Using Sequence Discriminant Support Vector Machines," IEEE Trans. Speech Audio Process., vol. 13, no. 2, 2005, pp. 203-210.   DOI
15 A. Stolcke et al., "MLLR Transforms as Features in Speaker Recognition," Proc. Interspeech, 2005.
16 W.M. Campbell et al., "SVM Based Speaker Verification Using a GMM Supervector Kernel and NAP Variability Compensation," Proc. ICASSP, 2006.
17 R. Dehak et al., "Linear and Non Linear GMM Supervector Machines for Speaker Verification," Proc. ICSLP, 2007.
18 A. Stolcke, L. Ferrer, and S. Kajarekar, "Improvements in MLLR-Transform Based Speaker Recognition," Proc. Odyssey, 2006.
19 H. Yang et al., "Cluster Adaptive Training Weights as Features in SVM-Based Speaker Verification," Proc. Interspeech, 2007.
20 C. LongWorth, Kernel Methods for Text-Independent Speaker Verification, Ph.D Thesis, Cambridge University Engineering Department, 2010.
21 X. Dong, W. Zhaohui, and Y. Yingchun, "Exploiting Support Vector Machines in Hidden Markov Models for Speaker Verification," Proc. ICSLP, 2002, pp. 1329-1332.
22 R.A. Redner and H.F. Walker, "Mixture Densities, Maximum Likelihood and the EM Algorithm," SIAM Rev., vol. 26, no. 2, 1984, pp. 195-239.   DOI   ScienceOn
23 A.P. Dempster, N.M. Laird, and D.B. Rubin, "Maximum Likelihood from Incomplete Data via EM Algorithm," J. Royal Statist. Soc., vol. 39, no. 1, 1997, pp. 1-38.
24 G.J. McLachlan and T. Krishnan, The EM Algorithm and Extensions, New York: Wiley, 1997.
25 S. Pettersen, M. Johnsen, and C. Wellekens, "Variational Bayesian Learning of Speech GMMs for Feature Enhancement Based on Algonquin," Proc. ICASSP, vol. 4, 2007, pp. 905-908.
26 P. Somervuo, "Speech Modeling Using Variational Bayesian Mixture of Gaussians," Proc. ICSLP, 2002, pp. 1245-1248.
27 D.A. Reynolds, T.F. Quatieri, and R.B. Dunn, "Speaker Verification Using Adapted Gaussian Mixture Models," Digital Signal Processing, vol. 10, no. 1-3, 2000, pp. 19-41.   DOI   ScienceOn
28 C. Leggetter and P. Woodland, "Maximum Likelihood Linear Regression for Speaker Adaptation of Continuous Density HMMs," Computer Speech and Language, vol. 9, no. 2, 1995, pp. 171-185.   DOI   ScienceOn
29 A. Stolcke et al., "Speaker Recognition with Session Variability Normalization Based on MLLR Adaptation Transforms," IEEE Trans. Audio, Speech Language Process., vol. 15, no. 7, 2007, pp. 1987-1998.   DOI
30 Z. Karam and W. Campbell, "A New Kernel for SVM MLLR Based Speaker Recognition," Proc. Interspeech, 2007, pp. 290-293.
31 M.W. Mak, R. Hsiao, and B. Mak, "A Comparison of Various Adaptation Methods for Speaker Verification with Limited Enrollment Data," Proc. ICASSP, vol. 1, 2006, pp. 929-932.
32 J. Mariethoz and S. Bengio, "A Comparative Study of Adaptation Methods for Speaker Verification," Proc. ICSLP, 2002, pp. 581-584.
33 H. Attias, "Inferring Parameters and Structure of Latent Variable Models by Variational Bayes," Proc. 15th Conf. Uncertainty Artif. Intell., Stockholm, Sweden, 1999, pp. 21-30.
34 N. Nasios and A. Bors, "Variational Learning for Gaussian Mixture Models," IEEE Trans. Systems, Man, Cybern., Part B, vol. 36, no. 4, 2006, pp. 849-862.   DOI
35 C.M. Bishop, Pattern Recognition and Machine Learning, Springer Science, 2006.
36 G. Box and G. Tiao, Bayesian Inference in Statistical Models, MA: Addison-Wesley, 1992.
37 Y. Shen et al, "A Comparison of Variational and Markov Chain Monte Carlo Methods for Inference in Partially Observed Stochastic Dynamic Systems," J. Signal Process. Syst., vol. 61, no. 1, 2008, pp. 51-59.
38 M.I. Jordan et al., "An Introduction to Variational Methods for Graphical Models," Learning in Graphical Models, M.I. Jordan, Ed., Cambridge, MA: MIT Press, 1999, pp. 105-161.
39 T.S. Jaakkola and M.I. Jordan, "Bayesian Parameter Estimation via Variational Methods," Statistical Commputing, vol. 10, no. 1, 2000, pp. 25-37.   DOI   ScienceOn
40 M.J. Beal, Variational Algorithms for Approximate Bayesian Inference, Ph.D. Thesis, University of Cambridge, UK, 2003.
41 N. Ding and Z. Ou, "Variational Nonparametric Bayesian Hidden Markov Model," Proc. ICASSP, 2010, pp. 2098-2101.
42 D. Su, X. Wu, and L. Xu, "GMM-HMM Acoustic Model Training by a Two Level Procedure with Gaussian Components Determined by Automatic Model Selection," Proc. ICASSP, 2010, pp. 4890-4893.
43 Z. Ghahramani and M.J. Beal, "Variational Inference for Bayesian Mixtures of Factor Analyzers," Advances in Neural Information Processing Systems, vol. 12. Cambridge, MA: MIT Press, 2000, pp. 449-455.
44 D.M. Blei and M.I. Jordan, "Variational Inference for Dirichlet Process Mixtures," Bayesian Analysis, vol. 1, 2005, pp. 121-144.
45 S.J. Roberts and W.D. Penny, "Variational Bayes for Generalized Autoregressive Models," IEEE Trans. Signal Process., vol. 50, no. 9, 2002, pp. 2245-2257.   DOI   ScienceOn
46 T. Minka and J. Lafferty, "Expectation-Propagation for the Generative Aspect Model," Proc. 18th Conf. Uncertainty Artif. Intell., Edmonton, AB, Canada, 2002, pp. 352-359.
47 Y.W. Teh, K. Kurihara, and M. Welling, "Collapsed Variational Inference for HDP," Adv. Neural Info. Process. Syst., vol. 20, 2008.
48 R.A. Choudrey and S.J. Roberts, "Variational Mixture of Bayesian Independent Component Analyzers," Neural Comput., vol. 15, no. 1, 2003, pp. 213-252.   DOI   ScienceOn
49 V.P. Sahu, H.K. Mishra, and C.C. Shekar, "Variational Bayes Adapted GMM Based Models for Audio Clip Classification," Proc. Int. Conf. Pattern Recognition Mach. Intell., 2009, pp. 513-518.
50 H.K. Mishra and C.C Sekhar, "Variational Gaussian Mixture Models for Speech Emotion Recognition," Proc. Int. Conf. Adv. Pattern Recognition, 2009, pp. 183-186.
51 F. Valente, Variational Bayesian Methods for Audio Indexing, PhD Dissertation, Eurecom, Sept. 2005.
52 Q. Huang, J. Yang, and Y. Zhou, "Variational Bayesian Method for Speech Enhancement," Neurocomput., vol. 70, no. 16-18, 2007, pp. 3063-3067.   DOI   ScienceOn
53 X. Zhao et al., "Variational Bayesian Joint Factor Analysis for Speaker Verification," Proc. ICASSP, 2009, pp. 4049-4052.
54 S. Watanabe et al., "Variational Bayesian Estimation and Clustering for Speech Recognition," IEEE Trans. Speech Audio Process., vol. 12, no. 4, 2004, pp. 365-381.   DOI   ScienceOn
55 D. Cournapeau et al., "Using Online Model Comparison in the Variational Bayes Framework for Online Unsupervised Voice Activity Detection," Proc. ICASSP, 2010, pp. 4462-4465.
56 The NIST Year 2010 Speaker Recognition Evaluation Plan, December 23, 2009. Available: http://www.itl.nist.gov/iad/mig/tests/sre/2010/NIST_SRE10_evalplan.r6.pdf, Accessed on 2010-10-22.
57 M. Bijankhan et al., "TFarsdat, the Telephony Farsi Speech Database," Proc. EuroSpeech, 2003, pp. 1525-1528.
58 J. Godfrey, D. Graff, and A. Martin, "Public Databases for Speaker Recognition and Verification," Proc. ESCA Workshop Automatic Speaker Recognition, 1994, pp. 39-42.
59 M.H. Moattar and M.M. Homayounpour, "A Weighted Feature Voting Approach for Robust and Real-Time Voice Activity Detection," ETRI J., vol. 33, no. 1, Feb. 2011, pp. 99-109.   DOI
60 L.R. Rabiner and B.H. Juang, Fundamentals of Speech Recognition, Englewood Cliffs, NJ: Prentice-Hall, 1993.
61 R. Auckenthaler, M. Carey, and H. Lloyd-Thomas, "Score Normalization for Text-Independent Speaker Verification Systems," Digital Signal Process., vol. 10, no. 1-3, 2000, pp. 42-54.   DOI   ScienceOn
62 A.P. Varga et al., "The NOISEX-92 Study on the Effect of Additive Noise on Automatic Speech Recognition," Technical Report, DRA Speech Research Unit, 1992.