[KSCI] Korea Science Citation Index Service

http://dx.doi.org/10.5391/JKIIS.2009.19.6.773

Robust Speech Recognition using Vocal Tract Normalization for Emotional Variation

Kim, Weon-Goo (군산대학교 전기공학과)
Bang, Hyun-Jin (군산대학교 컴퓨터정보공학과)

Publication Information

Journal of the Korean Institute of Intelligent Systems / v.19, no.6, 2009 , pp. 773-778 More about this Journal

Abstract

This paper studied the training methods less affected by the emotional variation for the development of the robust speech recognition system. For this purpose, the effect of emotional variations on the speech signal were studied using speech database containing various emotions. The performance of the speech recognition system trained by using the speech signal containing no emotion is deteriorated if the test speech signal contains the emotions because of the emotional difference between the test and training data. In this study, it is observed that vocal tract length of the speaker is affected by the emotional variation and this effect is one of the reasons that makes the performance of the speech recognition system worse. In this paper, vocal tract normalization method is used to develop the robust speech recognition system for emotional variations. Experimental results from the isolated word recognition using HMM showed that the vocal tract normalization method reduced the error rate of the conventional recognition system by 41.9% when emotional test data was used.

Keywords

MFCC;

Citations & Related Records

Reference

1	J. C. Junqua, and J. P. Haton, Robustness in Automatic Speech Recognition - Fundamental and Applications, Kluwer Academic Publishers, 1996
2	A. Acero and R. M. Stern, 'Environmental robustness in automatic speech recognition,' in Proceedings of ICASSP, pp. 849-852, April 1990
3	H. Hermansky, N. Morgan, H. G. Hirsch, 'Recognition of speech in additive and convolutional noise based RASTA spectral processing', in Proceedings of ICASSP, pp. 83-86, 1993
4	J. Koehler, N. Morgan, H. Hermansky, H. G. Hirsch, G. Tong, 'Integrating RASTA-PLP into Speech Recognition', in Proc. ICASSP, pp. 421-424, 1994
5	K. R. Scherer, D. R. Ladd, and K. E. A. Silverman, 'Vocal cues to speaker affect: testing two models', Journal Acoustical Society of America, Vol. 76, No. 5, pp. 1346-1355, Nov. 1984 DOI ScienceOn
6	J. Sato, and S. Morishima, 'Emotion modeling in speech production using emotion space', in Proceedings of the IEEE International Workshop 1996, pp. 472-477, Piscataway, NJ, USA., 1996
7	M. Pitz, H. Ney, 'Vocal tract normalization equals linear transformation in cepstral space', IEEE Trans. Speech & Audio Processing, vol. 13, No. 5, pp. 930-944, 2005 DOI ScienceOn
8	강봉석, '음성 신호를 이용한 문장독립 감정 인식 시스템', 연세대학교 석사학위 논문, 2000
9	I. R. Murray and J. L. Arnott, 'Toward the simulation of emotion in synthetic speech: a review of the literature on human vocal emotion', Journal of Accoustal Society of America., pp. 1097-1108, Feb. 1993
10	F. Dellaert, T. Polzin, A. Waibel, 'Recognizing emotion in speech', in Proceedings of the ICSLP 96, Philadelphia, USA, Oct. 1996
11	S. Wegmann, D. McAllaster, J. Orlofl and B. Peskin, 'Speaker Normalization on Conversational Telephone Speech, in Proceedings of ICASSP, Atlanta, GA, pp. 339-342, May 1996
12	M. Lewis and J. M. Haviland, Handbook of Emotions, The Guilford Press 1993
13	J. Vroomen, R. Collier and S. Mozziconacci, 'Duration and intonation in emotional Speech', in Proceedings of Eurospeech '93, Vol.1, pp.577-580, Berlin, Germany, 1993
14	P. Alexandre, P. Lockwood, 'Root cepstral analysis: a unified view. application to speech processing in car noise environments', Speech Communication, vol. 12, no. 3, pp. 277-288, 1993 DOI ScienceOn
15	B. Heuft, T. Portele, M. Rauth, 'Emotions in time domain synthesis', in Proceedings of ICSLP '96, Vol. 3, pp.1974-1977, Philadelphia, PA, USA, 1996
16	J. E. Cahn, 'The generation of affect in synthesized speech', Journal of the American Voice I/O Society, Vol. 8, pp. 1-19, July 1990
17	L. Welling, R. Haeb-Umbach, X. Aubert and N. Haberland, 'A study on speaker Normalization using vocal tract normalization and speaker adaptive training', in Proceedings of ICASSP, Seattle, WA, pp. 797-800, May 1998
18	T. S. Huang, L. S. Chen and H. Tao, 'Bimodal emotion recognition by man and machine', in ATR Workshop on Virtual Communication Environments-Bridges over Art/Kansei and VR Technologies, Kyoto, Japan, 1998
19	N. Amir, 'Classifying emotions in speech: a comparison of methods', in Proceedings of Eurospeech '2001, Vol. 1, pp. 127-130, Aalborg, Denmark, 2001
20	E. Eide and H. Gish, 'A parametric approach to vocal tract length normalization', in Proceedings of ICASSP, Atlanta, GA, pp.346-349, May 1996
21	M. G. Rahim, B. H. Juang, 'Signal bias removal by maximum likelihood estimation for robust telephone speech recognition', IEEE Trans. Speech & Audio Processing, vol. 4, No. 1, pp. 19-30, 1996 DOI ScienceOn
22	A. Acero and R. M. Stern, 'Robust speech recognition by normalization of the acoustic space', in Proceedings of. ICASSP, Toronto, pp. 893-896, May 1991
23	A. Nogueiras, etc, 'Speech emotion recognition using Hidden Markov Models', in Proceedings of Eurospeech '2001, Vol. 4, pp. 2679-2682, Aalborg, Denmark, 2001
24	R. W. Picard, Affective Computing, The MIT Press 1997
25	C. E. Williams and K. N. Stevens, 'Emotions and speech: some acoustical correlates', Journal Acoustical Society of America, Vol. 52, No. 4, pp. 1238-1250, 1972 DOI
26	Sirko Molau, Stephan Kanthak , Hermann Ney, 'Efficient Vocal Tract Normalization in Automatic Speech Recognition', in Proceedings of of the ESSV'00, Cottbus, Germany, pp. 209-216, 2000
27	T. S. Polzin and A. H. Waibel, 'Detecting emotions in speech', Proceedings of the CMC (Cooperative Multimodal Communication), 1998
28	H. Hermansky, N. Morgan, A. Bayya, P. Kohn, 'Compensation for the effect of the communication channel in auditory-like analysis of speech (RASTA-PLP)', in Proceedings of EUROSPEECH, vol. 3, pp. 1367-1370, 1991

KSCI

Robust Speech Recognition using Vocal Tract Normalization for Emotional Variation 성도 정규화를 이용한 감정 변화에 강인한 음성 인식

Robust Speech Recognition using Vocal Tract Normalization for Emotional Variation