Browse > Article
http://dx.doi.org/10.13064/KSSS.2015.7.1.003

Evaluation of Frequency Warping Based Features and Spectro-Temporal Features for Speaker Recognition  

Choi, Young Ho (부산대학교)
Ban, Sung Min (부산대학교)
Kim, Kyung-Wha (대검찰청 음성분석실)
Kim, Hyung Soon (부산대학교)
Publication Information
Phonetics and Speech Sciences / v.7, no.1, 2015 , pp. 3-10 More about this Journal
Abstract
In this paper, different frequency scales in cepstral feature extraction are evaluated for the text-independent speaker recognition. To this end, mel-frequency cepstral coefficients (MFCCs), linear frequency cepstral coefficients (LFCCs), and bilinear warped frequency cepstral coefficients (BWFCCs) are applied to the speaker recognition experiment. In addition, the spectro-temporal features extracted by the cepstral-time matrix (CTM) are examined as an alternative to the delta and delta-delta features. Experiments on the NIST speaker recognition evaluation (SRE) 2004 task are carried out using the Gaussian mixture model-universal background model (GMM-UBM) method and the joint factor analysis (JFA) method, both based on the ALIZE 3.0 toolkit. Experimental results using both the methods show that BWFCC with appropriate warping factor yields better performance than MFCC and LFCC. It is also shown that the feature set including the spectro-temporal information based on the CTM outperforms the conventional feature set including the delta and delta-delta features.
Keywords
speaker recognition; GMM-UBM; JFA; MFCC; LFCC; BWFCC; delta feature; cepstral-time matrix;
Citations & Related Records
연도 인용수 순위
  • Reference
1 Senoussaoui, M., Kenny, P., Dehak, N., Dumouchel, P. (2010). An i-vector extractor suitable for speaker recognition with both microphone and telephone speech. Proc. Odyssey Speaker and Language Recognition Workshop, 28-33.
2 Davis, S., Mermelstein, P. (1980). Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans. Acoustics, Speech Signal Process, Vol. 28, No. 4, 357-366.   DOI
3 Zhou, X., Garcia-Romero, D., Duraiswami, R., Espy-Wilson, C., Shamma, S. (2011). Linear versus mel frequency cepstral coefficients for speaker recognition. Proc. ASRU Workshop, 559-564.
4 Furui, S. (1981). Cepstral analysis technique for automatic speaker verification. IEEE Trans. Acoustics, Speech Signal Process, Vol. 29, No. 2, 254-272.   DOI
5 Kinnunen, T., Koh, C., Wang, L., Li, H., Chng, E. (2006). Temporal discrete cosine transform: Towards longer term temporal features for speaker verification. Proc. ISCSLP, 547-558.
6 Milner, B. P., Vaseghi, S. V. (1995). An analysis of cepstral-time feature matrices for noise and channel robust speech recognition. Proc. Eurospeech, 519-522.
7 Stevens, S., Volkman, J., Newman, E. B. (1937). A scale for the measurement of the psychological magnitude pitch. Journal of the Acoustical Society of America, Vol. 8, No. 3, 185-190.   DOI
8 Wolfel, M., McDonough, J., Waibel, A. (2003). Warping and scaling of the minimum variance distortionless response. Proc. ASRU Workshop, 387-392.
9 Choi, Y. H., Ban, S. M., Lee, G. H., Kim, K. H. Kim, H. S. (2014). Performance comparison of different frequency scales in feature extraction for speaker recognition. Proceedings of 2014 Fall Conference of Korean Society of Speech Sciences, 195-196. (최영호, 반성민, 이가희, 김경화, 김형순 (2014). 화자인식 특징추출을 위한 주파수 스케일 성능 비교. 2014 한국음성학회 가을 학술대회 발표 논문집, 195-196.)
10 Kumar, P., Rao, P. (2004). A study of frequency-scale warping for speaker recognition. Proc. NCC 2004, 203-207.
11 Zhang, W. Q., Deng, Y., He, L., Liu, J. (2010). Variant time-frequency cepstral features for speaker recognition. Proc. Interspeech, 2122-2125.
12 Larcher, A., Bonastre, J. F., Fauve, B., Lee, K. A., Levy, C., Li, H., Mason, J. S., Parfait, J. Y. (2013). ALIZE 3.0 - open source toolkit for state-of-the-art speaker recognition. Proc. Interspeech, 2768-2773.
13 The evaluation plan of NIST 2004 speaker recognition evaluation campaign. http://www.itl.nist.gov/iad/mig/tests/spk/2004/SRE-04_evalplan-v1a.pdf.
14 Brandschain, L., Graff, D., Cieri, C., Walker, K., Caruso, C., Neely, A. (2010). The mixer 6 corpus: Resources for cross-channel and text independent speaker recognition. Proc. LREC 2010, 2441-2444.
15 Kenny, P. (2006). Joint factor analysis of speaker and session variability: Theory and algorithms. http://www.crim.ca/perso/patrick.kenny/
16 Kinnunen, T. & Li, H. (2010). An overview of text-independent speaker recognition: From features to supervectors. Speech Commun, Vol. 52, No. 1, 12-40.   DOI
17 Reynolds, D., Quatieri, T., Dunn, R. (2000). Speaker verification using adapted gaussian mixture models. Digital Signal Process, Vol. 10, No. 1, 19-41.   DOI
18 Campbell, W., Campbell, J., Reynolds, D., Singer, E., Torres-Carrasquillo, P. (2006). Support vector machines for speaker and language recognition. Computer Speech & Language, Vol. 20, No. 2-3, 210-229.   DOI