DOI QR코드

DOI QR Code

Evaluation of Frequency Warping Based Features and Spectro-Temporal Features for Speaker Recognition

화자인식을 위한 주파수 워핑 기반 특징 및 주파수-시간 특징 평가

  • Received : 2015.01.12
  • Accepted : 2015.03.16
  • Published : 2015.03.31

Abstract

In this paper, different frequency scales in cepstral feature extraction are evaluated for the text-independent speaker recognition. To this end, mel-frequency cepstral coefficients (MFCCs), linear frequency cepstral coefficients (LFCCs), and bilinear warped frequency cepstral coefficients (BWFCCs) are applied to the speaker recognition experiment. In addition, the spectro-temporal features extracted by the cepstral-time matrix (CTM) are examined as an alternative to the delta and delta-delta features. Experiments on the NIST speaker recognition evaluation (SRE) 2004 task are carried out using the Gaussian mixture model-universal background model (GMM-UBM) method and the joint factor analysis (JFA) method, both based on the ALIZE 3.0 toolkit. Experimental results using both the methods show that BWFCC with appropriate warping factor yields better performance than MFCC and LFCC. It is also shown that the feature set including the spectro-temporal information based on the CTM outperforms the conventional feature set including the delta and delta-delta features.

Keywords

References

  1. Kinnunen, T. & Li, H. (2010). An overview of text-independent speaker recognition: From features to supervectors. Speech Commun, Vol. 52, No. 1, 12-40. https://doi.org/10.1016/j.specom.2009.08.009
  2. Reynolds, D., Quatieri, T., Dunn, R. (2000). Speaker verification using adapted gaussian mixture models. Digital Signal Process, Vol. 10, No. 1, 19-41. https://doi.org/10.1006/dspr.1999.0361
  3. Campbell, W., Campbell, J., Reynolds, D., Singer, E., Torres-Carrasquillo, P. (2006). Support vector machines for speaker and language recognition. Computer Speech & Language, Vol. 20, No. 2-3, 210-229. https://doi.org/10.1016/j.csl.2005.06.003
  4. Kenny, P. (2006). Joint factor analysis of speaker and session variability: Theory and algorithms. http://www.crim.ca/perso/patrick.kenny/
  5. Senoussaoui, M., Kenny, P., Dehak, N., Dumouchel, P. (2010). An i-vector extractor suitable for speaker recognition with both microphone and telephone speech. Proc. Odyssey Speaker and Language Recognition Workshop, 28-33.
  6. Davis, S., Mermelstein, P. (1980). Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans. Acoustics, Speech Signal Process, Vol. 28, No. 4, 357-366. https://doi.org/10.1109/TASSP.1980.1163420
  7. Zhou, X., Garcia-Romero, D., Duraiswami, R., Espy-Wilson, C., Shamma, S. (2011). Linear versus mel frequency cepstral coefficients for speaker recognition. Proc. ASRU Workshop, 559-564.
  8. Furui, S. (1981). Cepstral analysis technique for automatic speaker verification. IEEE Trans. Acoustics, Speech Signal Process, Vol. 29, No. 2, 254-272. https://doi.org/10.1109/TASSP.1981.1163530
  9. Kinnunen, T., Koh, C., Wang, L., Li, H., Chng, E. (2006). Temporal discrete cosine transform: Towards longer term temporal features for speaker verification. Proc. ISCSLP, 547-558.
  10. Milner, B. P., Vaseghi, S. V. (1995). An analysis of cepstral-time feature matrices for noise and channel robust speech recognition. Proc. Eurospeech, 519-522.
  11. Stevens, S., Volkman, J., Newman, E. B. (1937). A scale for the measurement of the psychological magnitude pitch. Journal of the Acoustical Society of America, Vol. 8, No. 3, 185-190. https://doi.org/10.1121/1.1915893
  12. Wolfel, M., McDonough, J., Waibel, A. (2003). Warping and scaling of the minimum variance distortionless response. Proc. ASRU Workshop, 387-392.
  13. Choi, Y. H., Ban, S. M., Lee, G. H., Kim, K. H. Kim, H. S. (2014). Performance comparison of different frequency scales in feature extraction for speaker recognition. Proceedings of 2014 Fall Conference of Korean Society of Speech Sciences, 195-196. (최영호, 반성민, 이가희, 김경화, 김형순 (2014). 화자인식 특징추출을 위한 주파수 스케일 성능 비교. 2014 한국음성학회 가을 학술대회 발표 논문집, 195-196.)
  14. Kumar, P., Rao, P. (2004). A study of frequency-scale warping for speaker recognition. Proc. NCC 2004, 203-207.
  15. Zhang, W. Q., Deng, Y., He, L., Liu, J. (2010). Variant time-frequency cepstral features for speaker recognition. Proc. Interspeech, 2122-2125.
  16. Larcher, A., Bonastre, J. F., Fauve, B., Lee, K. A., Levy, C., Li, H., Mason, J. S., Parfait, J. Y. (2013). ALIZE 3.0 - open source toolkit for state-of-the-art speaker recognition. Proc. Interspeech, 2768-2773.
  17. The evaluation plan of NIST 2004 speaker recognition evaluation campaign. http://www.itl.nist.gov/iad/mig/tests/spk/2004/SRE-04_evalplan-v1a.pdf.
  18. Brandschain, L., Graff, D., Cieri, C., Walker, K., Caruso, C., Neely, A. (2010). The mixer 6 corpus: Resources for cross-channel and text independent speaker recognition. Proc. LREC 2010, 2441-2444.