DOI QR코드

DOI QR Code

Speaker Verification with the Constraint of Limited Data

  • 투고 : 2016.01.27
  • 심사 : 2017.02.09
  • 발행 : 2018.08.31

초록

Speaker verification system performance depends on the utterance of each speaker. To verify the speaker, important information has to be captured from the utterance. Nowadays under the constraints of limited data, speaker verification has become a challenging task. The testing and training data are in terms of few seconds in limited data. The feature vectors extracted from single frame size and rate (SFSR) analysis is not sufficient for training and testing speakers in speaker verification. This leads to poor speaker modeling during training and may not provide good decision during testing. The problem is to be resolved by increasing feature vectors of training and testing data to the same duration. For that we are using multiple frame size (MFS), multiple frame rate (MFR), and multiple frame size and rate (MFSR) analysis techniques for speaker verification under limited data condition. These analysis techniques relatively extract more feature vector during training and testing and develop improved modeling and testing for limited data. To demonstrate this we have used mel-frequency cepstral coefficients (MFCC) and linear prediction cepstral coefficients (LPCC) as feature. Gaussian mixture model (GMM) and GMM-universal background model (GMM-UBM) are used for modeling the speaker. The database used is NIST-2003. The experimental results indicate that, improved performance of MFS, MFR, and MFSR analysis radically better compared with SFSR analysis. The experimental results show that LPCC based MFSR analysis perform better compared to other analysis techniques and feature extraction techniques.

키워드

참고문헌

  1. A. K. Jain, A. Ross, and S. Prabhakar, "An introduction to biometric recognition," IEEE Transactions on Circuits and Systems for Video Technology, vol. 14, no. 1, pp. 4-20, 2004. https://doi.org/10.1109/TCSVT.2004.839484
  2. S. Dey, S. Barman, R. K. Bhukya, R. K. Das, B. C. Haris, S. R. M. Prasanna, and R. Sinha, "Speech biometric based attendance system," in Proceedings of 2014 Twentieth National Conference on Communications (NCC), Kanpur, India, 2014, pp. 1-6.
  3. T. Kinnunen and H. Li, "An overview of text-independent speaker recognition: from features to supervectors," Speech Communication, vol. 52, no. 1, pp. 12-40, 2010. https://doi.org/10.1016/j.specom.2009.08.009
  4. G. Pradhan and S. M. Prasanna, "Speaker verification under degraded condition: a perceptual study," International Journal of Speech Technology, vol. 14, no. 4, pp. 405-417, 2011. https://doi.org/10.1007/s10772-011-9120-6
  5. A. E. Rosenberg, "Automatic speaker verification: a review" Proceedings of the IEEE, vol. 64, no. 4, pp. 475-487, 1976. https://doi.org/10.1109/PROC.1976.10156
  6. A. Neustein and H. A. Patil, Forensic Speaker Recognition. Heidelberg: Springer, 2012.
  7. H. S. Jayanna and S. M. Prasanna, "Analysis, feature extraction, modeling and testing techniques for speaker recognition," IETE Technical Review, vol. 26, no. 3, pp. 181-190, 2009. https://doi.org/10.4103/0256-4602.50702
  8. H. S. Jayanna, "Limited data speaker recognition," Ph.D. dissertation, Indian Institute of Technology Guwahati, India, 2009.
  9. D. Pati and S. M. Prasanna, "Subsegmental, segmental and suprasegmental processing of linear prediction residual for speaker information," International Journal of Speech Technology, vol. 14, no. 1, pp. 49-64, 2011. https://doi.org/10.1007/s10772-010-9087-8
  10. L. R. Rabiner and B. H. Juang, Fundamentals of Speech Recognition. Englewood Cliffs, NJ: Prentice Hall, 1993.
  11. S. M. Prasanna, C. G. Gupta, and B. Yegnanarayana, "Extraction of speaker-specific excitation information from linear prediction residual of speech," Speech Communication, vol. 48, no. 10, pp. 1243-1261, 2006. https://doi.org/10.1016/j.specom.2006.06.002
  12. B. Yegnanarayana, S. M. Prasanna, J. M. Zachariah, and C. S. Gupta, "Combining evidence from source, suprasegmental and spectral features for a fixed-text speaker verification system," IEEE Transactions on Speech and Audio Processing, vol. 13, no. 4, pp. 575-582, 2005. https://doi.org/10.1109/TSA.2005.848892
  13. F. Farahani, P. G. Georgiou, and S. S. Narayanan, "Speaker identification using supra-segmental pitch pattern dynamics," in Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, Montreal, Canada, 2004, pp. 89-92.
  14. A. V. Jadhav and R. V. Pawar, "Review of various approaches towards speech recognition," in Proceedings of 2012 International Conference on Biomedical Engineering (ICoBE), Penang, Malaysia, 2012, pp. 99-103.
  15. H. S. Jayanna and S. M. Prasanna, "Multiple frame size and rate analysis for speaker recognition under limited data condition," IET Signal Processing, vol. 3, no. 3, pp. 189-204, 2009. https://doi.org/10.1049/iet-spr.2008.0211
  16. G. L. Sarada, T. Nagarajan, and H. A. Murthy, "Multiple frame size and multiple frame rate feature extraction for speech recognition," in Proceedings of 2004 International Conference on Signal Processing and Communications, Bangalore, India, 2004, pp. 592-595.
  17. K. Samudravijaya, "Variable frame size analysis for speech recognition," in Proceedings of the International Conference on Natural Language Processing, Hyderabad, India, 2004.
  18. Q. Zhu and A. Alwan, "On the use of variable frame rate analysis in speech recognition," in Proceedings of 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing, Istanbul, Turkey, 2000, pp. 1783-1786.
  19. P. Le Cerf and D. Van Compernolle, "A new variable frame analysis method for speech recognition," IEEE Signal Processing Letters, vol. 1, no. 12, pp. 185-187, 1994. https://doi.org/10.1109/97.338746
  20. R. Pawar and H. Kulkarni, "Analysis of FFSR, VFSR, MFSR techniques for feature extraction in speaker recognition: a review," International Journal of Computer Science, vol. 7, no. 4, pp. 26-31, 2010.
  21. T. Nagarajan, "Implicit systems for spoken language identification," Ph.D. dissertation, Indian Institute of Technology Madras, India, 2004.
  22. G. S. Ghadiyaram, N. H. Nagarajan, T. N. Thangavelu, and H. A. Murthy, "Automatic transcription of continuous speech using unsupervised and incremental training," in Proceedings of the 8th International Conference on Spoken Language Processing, Jeju Island, Korea, 2004.
  23. National Institute of Standards and Technology, "The NIST Year 2003 speaker recognition evaluation plan," 2013 [Online]. Available: https://www.nist.gov/sites/default/files/documents/2017/09/26/2003-spkrec-evalplanv2.2.pdf
  24. S. Nakagawa, L. Wang, and S. Ohtsuka, "Speaker identification and verification by combining MFCC and phase information," IEEE Transactions on Audio, Speech, and Language Processing, vol. 20, no. 4, pp. 1085-1095, 2012. https://doi.org/10.1109/TASL.2011.2172422
  25. A. Salman, E. Muhammad, and K. Khurshid, "Speaker verification using boosted cepstral features with Gaussian distributions," in Proceedings of IEEE International Multitopic Conference, Lahore, Pakistan, 2007, pp. 1-5.
  26. D. Pat and S. M. Prasanna, "Processing of linear prediction residual in spectral and cepstral domains for speaker information," International Journal of Speech Technology, vol. 18, no. 3, pp. 333-350, 2015. https://doi.org/10.1007/s10772-015-9273-9
  27. W. C. Hsu, W. H. Lai, and W. P. Hong, "Usefulness of residual-based features in speaker verification and their combination way with linear prediction coefficients," in Proceedings of the 9th IEEE International Symposium on Multimedia Workshops, Beijing, China, 2007, pp. 246-251.
  28. S. Furui, "Comparison of speaker recognition methods using statistical features and dynamic features," IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 29, no. 3, pp. 342-350, 1981. https://doi.org/10.1109/TASSP.1981.1163605
  29. V. Prakash and J. H. L. Hansen, "In-set/out-of-set speaker recognition under sparse enrollment," IEEE Transactions on Audio, Speech, and Language Processing, vol. 15, no. 7, pp. 2044-2052, 2007. https://doi.org/10.1109/TASL.2007.902058
  30. T. Hasan and J. H. Hansen, "A study on universal background model training in speaker verification," IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, no. 7, pp. 1890-1899, 2011. https://doi.org/10.1109/TASL.2010.2102753
  31. N. Dehak, P. J. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet, "Front-end factor analysis for speaker verification," IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, no. 4, pp. 788-798, 2011. https://doi.org/10.1109/TASL.2010.2064307
  32. E. Wong and S. Sridharan, "Comparison of linear prediction cepstrum coefficients and mel-frequency cepstrum coefficients for language identification," in Proceedings of 2001 International Symposium on Intelligent Multimedia, Video and Speech Processing, Hong Kong, China, 2001, pp. 95-98.