Evaluation of Frequency Warping Based Features and Spectro-Temporal Features for Speaker Recognition

Choi, Young Ho;Ban, Sung Min;Kim, Kyung-Wha;Kim, Hyung Soon;

doi:10.13064/KSSS.2015.7.1.003

Phonetics and Speech Sciences (말소리와 음성과학)

Volume 7 Issue 1
/
Pages.3-10
/
2015
/
2005-8063(pISSN)
/
2586-5854(eISSN)

Korean Society of Speech Sciences (한국음성학회)

DOI QR Code

Evaluation of Frequency Warping Based Features and Spectro-Temporal Features for Speaker Recognition

화자인식을 위한 주파수 워핑 기반 특징 및 주파수-시간 특징 평가

최영호 (부산대학교) ;
반성민 (부산대학교) ;
김경화 (대검찰청 음성분석실) ;
김형순 (부산대학교)

Received : 2015.01.12
Accepted : 2015.03.16
Published : 2015.03.31

https://doi.org/10.13064/KSSS.2015.7.1.003 Citation PDF KSCI

Download PDF

⟨ Previous Next ⟩

Abstract

In this paper, different frequency scales in cepstral feature extraction are evaluated for the text-independent speaker recognition. To this end, mel-frequency cepstral coefficients (MFCCs), linear frequency cepstral coefficients (LFCCs), and bilinear warped frequency cepstral coefficients (BWFCCs) are applied to the speaker recognition experiment. In addition, the spectro-temporal features extracted by the cepstral-time matrix (CTM) are examined as an alternative to the delta and delta-delta features. Experiments on the NIST speaker recognition evaluation (SRE) 2004 task are carried out using the Gaussian mixture model-universal background model (GMM-UBM) method and the joint factor analysis (JFA) method, both based on the ALIZE 3.0 toolkit. Experimental results using both the methods show that BWFCC with appropriate warping factor yields better performance than MFCC and LFCC. It is also shown that the feature set including the spectro-temporal information based on the CTM outperforms the conventional feature set including the delta and delta-delta features.

Keywords

References

Kinnunen, T. & Li, H. (2010). An overview of text-independent speaker recognition: From features to supervectors. Speech Commun, Vol. 52, No. 1, 12-40. https://doi.org/10.1016/j.specom.2009.08.009
Reynolds, D., Quatieri, T., Dunn, R. (2000). Speaker verification using adapted gaussian mixture models. Digital Signal Process, Vol. 10, No. 1, 19-41. https://doi.org/10.1006/dspr.1999.0361
Campbell, W., Campbell, J., Reynolds, D., Singer, E., Torres-Carrasquillo, P. (2006). Support vector machines for speaker and language recognition. Computer Speech & Language, Vol. 20, No. 2-3, 210-229. https://doi.org/10.1016/j.csl.2005.06.003
Kenny, P. (2006). Joint factor analysis of speaker and session variability: Theory and algorithms. http://www.crim.ca/perso/patrick.kenny/
Senoussaoui, M., Kenny, P., Dehak, N., Dumouchel, P. (2010). An i-vector extractor suitable for speaker recognition with both microphone and telephone speech. Proc. Odyssey Speaker and Language Recognition Workshop, 28-33.
Davis, S., Mermelstein, P. (1980). Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans. Acoustics, Speech Signal Process, Vol. 28, No. 4, 357-366. https://doi.org/10.1109/TASSP.1980.1163420
Zhou, X., Garcia-Romero, D., Duraiswami, R., Espy-Wilson, C., Shamma, S. (2011). Linear versus mel frequency cepstral coefficients for speaker recognition. Proc. ASRU Workshop, 559-564.
Furui, S. (1981). Cepstral analysis technique for automatic speaker verification. IEEE Trans. Acoustics, Speech Signal Process, Vol. 29, No. 2, 254-272. https://doi.org/10.1109/TASSP.1981.1163530
Kinnunen, T., Koh, C., Wang, L., Li, H., Chng, E. (2006). Temporal discrete cosine transform: Towards longer term temporal features for speaker verification. Proc. ISCSLP, 547-558.
Milner, B. P., Vaseghi, S. V. (1995). An analysis of cepstral-time feature matrices for noise and channel robust speech recognition. Proc. Eurospeech, 519-522.
Stevens, S., Volkman, J., Newman, E. B. (1937). A scale for the measurement of the psychological magnitude pitch. Journal of the Acoustical Society of America, Vol. 8, No. 3, 185-190. https://doi.org/10.1121/1.1915893
Wolfel, M., McDonough, J., Waibel, A. (2003). Warping and scaling of the minimum variance distortionless response. Proc. ASRU Workshop, 387-392.
Choi, Y. H., Ban, S. M., Lee, G. H., Kim, K. H. Kim, H. S. (2014). Performance comparison of different frequency scales in feature extraction for speaker recognition. Proceedings of 2014 Fall Conference of Korean Society of Speech Sciences, 195-196. (최영호, 반성민, 이가희, 김경화, 김형순 (2014). 화자인식 특징추출을 위한 주파수 스케일 성능 비교. 2014 한국음성학회 가을 학술대회 발표 논문집, 195-196.)
Kumar, P., Rao, P. (2004). A study of frequency-scale warping for speaker recognition. Proc. NCC 2004, 203-207.
Zhang, W. Q., Deng, Y., He, L., Liu, J. (2010). Variant time-frequency cepstral features for speaker recognition. Proc. Interspeech, 2122-2125.
Larcher, A., Bonastre, J. F., Fauve, B., Lee, K. A., Levy, C., Li, H., Mason, J. S., Parfait, J. Y. (2013). ALIZE 3.0 - open source toolkit for state-of-the-art speaker recognition. Proc. Interspeech, 2768-2773.
The evaluation plan of NIST 2004 speaker recognition evaluation campaign. http://www.itl.nist.gov/iad/mig/tests/spk/2004/SRE-04_evalplan-v1a.pdf.
Brandschain, L., Graff, D., Cieri, C., Walker, K., Caruso, C., Neely, A. (2010). The mixer 6 corpus: Resources for cross-channel and text independent speaker recognition. Proc. LREC 2010, 2441-2444.

Phonetics and Speech Sciences (말소리와 음성과학)

Evaluation of Frequency Warping Based Features and Spectro-Temporal Features for Speaker Recognition

화자인식을 위한 주파수 워핑 기반 특징 및 주파수-시간 특징 평가

Abstract

Keywords

References

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)