Browse > Article
http://dx.doi.org/10.13064/KSSS.2014.6.1.077

Automatic Clustering of Speech Data Using Modified MAP Adaptation Technique  

Ban, Sung Min (부산대학교)
Kang, Byung Ok (한국전자통신연구원)
Kim, Hyung Soon (부산대학교)
Publication Information
Phonetics and Speech Sciences / v.6, no.1, 2014 , pp. 77-83 More about this Journal
Abstract
This paper proposes a speaker and environment clustering method in order to overcome the degradation of the speech recognition performance caused by various noise and speaker characteristics. In this paper, instead of using the distance between Gaussian mixture model (GMM) weight vectors as in the Google's approach, the distance between the adapted mean vectors based on the modified maximum a posteriori (MAP) adaptation is used as a distance measure for vector quantization (VQ) clustering. According to our experiments on the simulation data generated by adding noise to clean speech, the proposed clustering method yields error rate reduction of 10.6% compared with baseline speaker-independent (SI) model, which is slightly better performance than the Google's approach.
Keywords
speech recognition; speech data clustering; KL divergence; MAP adaptation;
Citations & Related Records
Times Cited By KSCI : 2  (Citation Analysis)
연도 인용수 순위
1 Deng, L., Droppo, J., & Acero A. (2003). Recursive estimation of nonstationary noise using iterative stochastic approximation for robust speech recognition. IEEE Trans. Speech Audio Process, 11, 6, 568-580.   DOI   ScienceOn
2 Zhang, Y., Xu, J., Yan, Z. J., & Huo, Q. (2011). An i-vector based approach to training data clustering for improved speech recognition. Proc. Interspeech, 1247-1250.
3 Gales, M. J. F. (1997). Maximum likelihood linear transformations for HMM based speech recognition. Cambridge Univ. Tech. Rep. TR 291, Cambridge, U.K.
4 Song, H. J., Jeon, H. B. & Kim, H. S. (2009). Fast speaker adaptation based on eigenspace-based MLLR using artificially distorted speech in car noise environment. Phonetics and Speech Sciences, 1(4), 119-125. (송화전, 전형배, 김형순 (2009). 차량 잡음 환경에서 인위적 왜곡 음성을 이용한 Eigenspace-based MLLR에 기반한 고속 화자 적응, 말소리와 음성과학, 1(4), 119-125.)   과학기술학회마을
5 Beaufays, F., Vanhoucke, V., & Strope, B. (2010). Unsupervised discovery and training of maximally dissimilar cluster models. Proc. Interspeech, 66-69.
6 Tsao, Y. & Lee, C. H. (2009). An ensemble speaker and speaking environment modeling approach to robust speech recognition. IEEE Trans. Audio, Speech, and Language Processing, 17(5), 1025-1037.   DOI
7 Lee, C. H., Lin, C. H. & Juang, B. H. (1991). A study on speaker adaptation of the parameters of continuous density hidden Markov models. IEEE Transactions on Signal Processing, 39(4), 806-814.   DOI   ScienceOn
8 Campbell, W. M., Sturim, D. E., Reynolds, D. A. & Solomonoff, A. (2006). SVM based speaker verification using a GMM supervector kernel and NAP variability compensation. Proc. ICASSP, 1, 97-100.
9 Ban, S. M., Kang, B. O., Lee, Y. K., & Kim, H. S. (2012). Automatic clustering of speech data using the distance between the cepstral mean vectors. Proc. 2012 Fall Conf. of the Korean Society of Speech Sciences, 35-36. (반성민, 강병옥, 이윤근, 김형순 (2012). 켑스트럼 평균벡터 거리를 이용한 음성 데이터 자동 클러스터링, 한국음성학회 가을 학술대회 발표논문집, 35-36.)
10 Lim, Y. & Lee Y. (1995). Implementation of the POW (phonetically optimized words) algorithm for speech database. Proc. ICASSP, 1, 89-92.
11 Lee, Y. J., Kim, B. W., Kim, J. J., Yang, O. Y. & Lim, S. Y. (1995). Some considerations for construction of PBW set. Proc. 12th Workshop on Speech Communications and Signal Processing. Korean Association of Speech Sciences, 310-314. (이용주, 김봉완, 김종진, 양옥렬, 임선영 (1995). 음성 DB용 PBW에 관한 검토, 제12회 음성통신 및 신호처리 워크샵 논문집, 한국음성학회, 310-314.)
12 Lee, S. J., Kang, B. O., Jung, H. Y., Lee, Y. K. & Kim, H. S. (2010). Statistical model-based noise reduction approach for car interior applications to speech recognition. ETRI Journal, 32(5), 801-809.   과학기술학회마을   DOI   ScienceOn
13 Gales, M. J. F. & Young, S. J. (1996). Robust continuous speech recognition using parallel model combination. IEEE Trans. on Speech and Audio Process, 5(5), 352-359.
14 Hilgerk, F., Molau S., & Ney H. (2002). Quantile based histogram equalization for online applications. Proc. ICSLP, 237-240.
15 Moreno, P. J., Raj B., & Stern, R. M. (1996). A vector Taylor series approach for environment-independent speech recognition. Proc. ICASSP, 733-736.