DOI QR코드

DOI QR Code

Text-Independent Speaker Verification Using Variational Gaussian Mixture Model

  • Received : 2010.11.15
  • Accepted : 2011.05.16
  • Published : 2011.12.31

Abstract

This paper concerns robust and reliable speaker model training for text-independent speaker verification. The baseline speaker modeling approach is the Gaussian mixture model (GMM). In text-independent speaker verification, the amount of speech data may be different for speakers. However, we still wish the modeling approach to perform equally well for all speakers. Besides, the modeling technique must be least vulnerable against unseen data. A traditional approach for GMM training is expectation maximization (EM) method, which is known for its overfitting problem and its weakness in handling insufficient training data. To tackle these problems, variational approximation is proposed. Variational approaches are known to be robust against overtraining and data insufficiency. We evaluated the proposed approach on two different databases, namely KING and TFarsdat. The experiments show that the proposed approach improves the performance on TFarsdat and KING databases by 0.56% and 4.81%, respectively. Also, the experiments show that the variationally optimized GMM is more robust against noise and the verification error rate in noisy environments for TFarsdat dataset decreases by 1.52%.

Keywords

References

  1. D.A. Reynolds, "Speaker Identification and Verification Using Gaussian Mixture Speaker Models," Speech Commun., vol. 17, no. 1-2, Aug. 1995, pp. 91-108. https://doi.org/10.1016/0167-6393(95)00009-D
  2. T. Kinnunen and H. Li, "An Overview of Text-Independent Speaker Recognition: From Features to Supervectors," Speech Commun., vol. 52, no. 1, Jan. 2010, pp. 12-40. https://doi.org/10.1016/j.specom.2009.08.009
  3. K.H. You and H.C. Wang, "Joint Estimation of Feature Transformation Parameters and Gaussian Mixture Model for Speaker Identification," Speech Commun., vol. 28, no. 3, July 1999, pp. 227-241. https://doi.org/10.1016/S0167-6393(99)00017-5
  4. K.R. Farrell, R. Mammone, and K. Assaleh, "Speaker Recognition Using Neural Networks and Conventional Classifiers," IEEE Trans. Speech, Audio Process., vol. 2, no. 1, 1994, pp. 194-205. https://doi.org/10.1109/89.260362
  5. Y. Gu and T. Thomas, "A Text-Independent Speaker Verification System Using Support Vector Machines Classifier," Proc. European Conf. Speech Commun. Technol., 2001, pp. 1765-1769.
  6. L. Heck et al., "Robustness to Telephone Handset Distortion in Speaker Recognition by Discriminative Feature Design," Speech Commun., vol. 31, no. 2-3, 2000, pp. 181-192. https://doi.org/10.1016/S0167-6393(99)00077-1
  7. F. Bimbot et al., "A Tutorial on Text-Independent Speaker Verification," EURASIP J. Appl. Signal Process., vol. 2004, no. 4, 2004, pp. 430-451. https://doi.org/10.1155/S1110865704310024
  8. W. Campbell et al., "Support Vector Machines for Speaker and Language Recognition," Comput. Speech Language, vol. 20, no. 2-3, 2006, pp. 210-229. https://doi.org/10.1016/j.csl.2005.06.003
  9. W. Campbell, D. Sturim, and D. Reynolds, "Support Vector Machines using GMM Supervectors for Speaker Verification," IEEE Signal Process. Lett., vol. 13, no. 5, 2006, pp. 308-311. https://doi.org/10.1109/LSP.2006.870086
  10. E. Shriberg et al., "Modeling Prosodic Feature Sequences for Speaker Recognition," Speech Commun., vol. 46, no. 3-4, 2005, pp. 455-472. https://doi.org/10.1016/j.specom.2005.02.018
  11. L. Ferrer et al., "Parameterization of Prosodic Feature Distributions for SVM Modeling in Speaker Recognition," Proc. ICASSP, vol. 4, 2007, pp. 233-236.
  12. W. Campbell et al., "Phonetic Speaker Recognition with Support Vector Machines," Adv. Neural Inf. Process. Syst., vol. 16, Cambridge, MA: MIT Press, 2004.
  13. W.M. Campbell, "Generalized Linear Discriminant Sequence Kernels for Speaker Recognition," Proc. ICASSP, 2002.
  14. V. Wan and S. Renals, "Speaker Verification Using Sequence Discriminant Support Vector Machines," IEEE Trans. Speech Audio Process., vol. 13, no. 2, 2005, pp. 203-210. https://doi.org/10.1109/TSA.2004.841042
  15. W.M. Campbell et al., "SVM Based Speaker Verification Using a GMM Supervector Kernel and NAP Variability Compensation," Proc. ICASSP, 2006.
  16. R. Dehak et al., "Linear and Non Linear GMM Supervector Machines for Speaker Verification," Proc. ICSLP, 2007.
  17. A. Stolcke, L. Ferrer, and S. Kajarekar, "Improvements in MLLR-Transform Based Speaker Recognition," Proc. Odyssey, 2006.
  18. A. Stolcke et al., "MLLR Transforms as Features in Speaker Recognition," Proc. Interspeech, 2005.
  19. H. Yang et al., "Cluster Adaptive Training Weights as Features in SVM-Based Speaker Verification," Proc. Interspeech, 2007.
  20. C. LongWorth, Kernel Methods for Text-Independent Speaker Verification, Ph.D Thesis, Cambridge University Engineering Department, 2010.
  21. X. Dong, W. Zhaohui, and Y. Yingchun, "Exploiting Support Vector Machines in Hidden Markov Models for Speaker Verification," Proc. ICSLP, 2002, pp. 1329-1332.
  22. R.A. Redner and H.F. Walker, "Mixture Densities, Maximum Likelihood and the EM Algorithm," SIAM Rev., vol. 26, no. 2, 1984, pp. 195-239. https://doi.org/10.1137/1026034
  23. A.P. Dempster, N.M. Laird, and D.B. Rubin, "Maximum Likelihood from Incomplete Data via EM Algorithm," J. Royal Statist. Soc., vol. 39, no. 1, 1997, pp. 1-38.
  24. G.J. McLachlan and T. Krishnan, The EM Algorithm and Extensions, New York: Wiley, 1997.
  25. S. Pettersen, M. Johnsen, and C. Wellekens, "Variational Bayesian Learning of Speech GMMs for Feature Enhancement Based on Algonquin," Proc. ICASSP, vol. 4, 2007, pp. 905-908.
  26. P. Somervuo, "Speech Modeling Using Variational Bayesian Mixture of Gaussians," Proc. ICSLP, 2002, pp. 1245-1248.
  27. D.A. Reynolds, T.F. Quatieri, and R.B. Dunn, "Speaker Verification Using Adapted Gaussian Mixture Models," Digital Signal Processing, vol. 10, no. 1-3, 2000, pp. 19-41. https://doi.org/10.1006/dspr.1999.0361
  28. C. Leggetter and P. Woodland, "Maximum Likelihood Linear Regression for Speaker Adaptation of Continuous Density HMMs," Computer Speech and Language, vol. 9, no. 2, 1995, pp. 171-185. https://doi.org/10.1006/csla.1995.0010
  29. Z. Karam and W. Campbell, "A New Kernel for SVM MLLR Based Speaker Recognition," Proc. Interspeech, 2007, pp. 290-293.
  30. A. Stolcke et al., "Speaker Recognition with Session Variability Normalization Based on MLLR Adaptation Transforms," IEEE Trans. Audio, Speech Language Process., vol. 15, no. 7, 2007, pp. 1987-1998. https://doi.org/10.1109/TASL.2007.902859
  31. M.W. Mak, R. Hsiao, and B. Mak, "A Comparison of Various Adaptation Methods for Speaker Verification with Limited Enrollment Data," Proc. ICASSP, vol. 1, 2006, pp. 929-932.
  32. J. Mariethoz and S. Bengio, "A Comparative Study of Adaptation Methods for Speaker Verification," Proc. ICSLP, 2002, pp. 581-584.
  33. H. Attias, "Inferring Parameters and Structure of Latent Variable Models by Variational Bayes," Proc. 15th Conf. Uncertainty Artif. Intell., Stockholm, Sweden, 1999, pp. 21-30.
  34. N. Nasios and A. Bors, "Variational Learning for Gaussian Mixture Models," IEEE Trans. Systems, Man, Cybern., Part B, vol. 36, no. 4, 2006, pp. 849-862. https://doi.org/10.1109/TSMCB.2006.872273
  35. C.M. Bishop, Pattern Recognition and Machine Learning, Springer Science, 2006.
  36. G. Box and G. Tiao, Bayesian Inference in Statistical Models, MA: Addison-Wesley, 1992.
  37. Y. Shen et al, "A Comparison of Variational and Markov Chain Monte Carlo Methods for Inference in Partially Observed Stochastic Dynamic Systems," J. Signal Process. Syst., vol. 61, no. 1, 2008, pp. 51-59.
  38. M.I. Jordan et al., "An Introduction to Variational Methods for Graphical Models," Learning in Graphical Models, M.I. Jordan, Ed., Cambridge, MA: MIT Press, 1999, pp. 105-161.
  39. T.S. Jaakkola and M.I. Jordan, "Bayesian Parameter Estimation via Variational Methods," Statistical Commputing, vol. 10, no. 1, 2000, pp. 25-37. https://doi.org/10.1023/A:1008932416310
  40. M.J. Beal, Variational Algorithms for Approximate Bayesian Inference, Ph.D. Thesis, University of Cambridge, UK, 2003.
  41. N. Ding and Z. Ou, "Variational Nonparametric Bayesian Hidden Markov Model," Proc. ICASSP, 2010, pp. 2098-2101.
  42. D. Su, X. Wu, and L. Xu, "GMM-HMM Acoustic Model Training by a Two Level Procedure with Gaussian Components Determined by Automatic Model Selection," Proc. ICASSP, 2010, pp. 4890-4893.
  43. Z. Ghahramani and M.J. Beal, "Variational Inference for Bayesian Mixtures of Factor Analyzers," Advances in Neural Information Processing Systems, vol. 12. Cambridge, MA: MIT Press, 2000, pp. 449-455.
  44. S.J. Roberts and W.D. Penny, "Variational Bayes for Generalized Autoregressive Models," IEEE Trans. Signal Process., vol. 50, no. 9, 2002, pp. 2245-2257. https://doi.org/10.1109/TSP.2002.801921
  45. T. Minka and J. Lafferty, "Expectation-Propagation for the Generative Aspect Model," Proc. 18th Conf. Uncertainty Artif. Intell., Edmonton, AB, Canada, 2002, pp. 352-359.
  46. Y.W. Teh, K. Kurihara, and M. Welling, "Collapsed Variational Inference for HDP," Adv. Neural Info. Process. Syst., vol. 20, 2008.
  47. D.M. Blei and M.I. Jordan, "Variational Inference for Dirichlet Process Mixtures," Bayesian Analysis, vol. 1, 2005, pp. 121-144.
  48. R.A. Choudrey and S.J. Roberts, "Variational Mixture of Bayesian Independent Component Analyzers," Neural Comput., vol. 15, no. 1, 2003, pp. 213-252. https://doi.org/10.1162/089976603321043766
  49. V.P. Sahu, H.K. Mishra, and C.C. Shekar, "Variational Bayes Adapted GMM Based Models for Audio Clip Classification," Proc. Int. Conf. Pattern Recognition Mach. Intell., 2009, pp. 513-518.
  50. H.K. Mishra and C.C Sekhar, "Variational Gaussian Mixture Models for Speech Emotion Recognition," Proc. Int. Conf. Adv. Pattern Recognition, 2009, pp. 183-186.
  51. F. Valente, Variational Bayesian Methods for Audio Indexing, PhD Dissertation, Eurecom, Sept. 2005.
  52. X. Zhao et al., "Variational Bayesian Joint Factor Analysis for Speaker Verification," Proc. ICASSP, 2009, pp. 4049-4052.
  53. S. Watanabe et al., "Variational Bayesian Estimation and Clustering for Speech Recognition," IEEE Trans. Speech Audio Process., vol. 12, no. 4, 2004, pp. 365-381. https://doi.org/10.1109/TSA.2004.828640
  54. D. Cournapeau et al., "Using Online Model Comparison in the Variational Bayes Framework for Online Unsupervised Voice Activity Detection," Proc. ICASSP, 2010, pp. 4462-4465.
  55. Q. Huang, J. Yang, and Y. Zhou, "Variational Bayesian Method for Speech Enhancement," Neurocomput., vol. 70, no. 16-18, 2007, pp. 3063-3067. https://doi.org/10.1016/j.neucom.2007.04.005
  56. The NIST Year 2010 Speaker Recognition Evaluation Plan, December 23, 2009. Available: http://www.itl.nist.gov/iad/mig/tests/sre/2010/NIST_SRE10_evalplan.r6.pdf, Accessed on 2010-10-22.
  57. M. Bijankhan et al., "TFarsdat, the Telephony Farsi Speech Database," Proc. EuroSpeech, 2003, pp. 1525-1528.
  58. J. Godfrey, D. Graff, and A. Martin, "Public Databases for Speaker Recognition and Verification," Proc. ESCA Workshop Automatic Speaker Recognition, 1994, pp. 39-42.
  59. M.H. Moattar and M.M. Homayounpour, "A Weighted Feature Voting Approach for Robust and Real-Time Voice Activity Detection," ETRI J., vol. 33, no. 1, Feb. 2011, pp. 99-109. https://doi.org/10.4218/etrij.11.1510.0158
  60. L.R. Rabiner and B.H. Juang, Fundamentals of Speech Recognition, Englewood Cliffs, NJ: Prentice-Hall, 1993.
  61. R. Auckenthaler, M. Carey, and H. Lloyd-Thomas, "Score Normalization for Text-Independent Speaker Verification Systems," Digital Signal Process., vol. 10, no. 1-3, 2000, pp. 42-54. https://doi.org/10.1006/dspr.1999.0360
  62. A.P. Varga et al., "The NOISEX-92 Study on the Effect of Additive Noise on Automatic Speech Recognition," Technical Report, DRA Speech Research Unit, 1992.