A Study on the Optimization of State Tying Acoustic Models using Mixture Gaussian Clustering

혼합 가우시안 군집화를 이용한 상태공유 음향모델 최적화

  • 안태옥 (호원대학교 컴퓨터학부)
  • Published : 2005.11.01

Abstract

This paper describes how the state tying model based on the decision tree which is one of Acoustic models used for speech recognition optimizes the model by reducing the number of mixture Gaussians of the output probability distribution. The state tying modeling uses a finite set of questions which is possible to include the phonological knowledge and the likelihood based decision criteria. And the recognition rate can be improved by increasing the number of mixture Gaussians of the output probability distribution. In this paper, we'll reduce the number of mixture Gaussians at the highest point of recognition rate by clustering the Gaussians. Bhattacharyya and Euclidean method will be used for the distance measure needed when clustering. And after calculating the mean and variance between the pair of lowest distance, the new Gaussians are created. The parameters for the new Gaussians are derived from the parameters of the Gaussians from which it is born. Experiments have been performed using the STOCKNAME (1,680) databases. And the test results show that the proposed method using Bhattacharyya distance measure maintains their recognition rate at $97.2\%$ and reduces the ratio of the number of mixture Gaussians by $1.0\%$. And the method using Euclidean distance measure shows that it maintains the recognition rate at $96.9\%$ and reduces the ratio of the number of mixture Gaussians by $1.0\%$. Then the methods can optimize the state tying model.

본 논문은 음성인식에 쓰이는 음향모델의 모델링 방법 중 결정트리 상태공유 모델링(DTST)을 기반으로 출력 확률 분포의 혼합 가우시안 수를 줄여 모델을 최적화하는 방법을 제안한다. DTST는 음성학적 지식을 포함할 수 있는 질의어 집합과 유사도를 기반으로 한 결정 방법을 이용하는 것이다. 이때 상태들의 출력 확률 분포의 혼합 가우시안 수를 늘려 인식률을 증가시킬 수 있게 된다. 본 논문에서는 인식률이 최대가 되는 지점에서 혼합 가우시안들을 군집화 하여 그 수를 줄이고자 한다. 군집화 시에 필요한 거리 측정 방법은 유클리드(Euclidean)와 바타챠랴(Bhattacharyya) 방법을 이용하였고, 새로운 가우시안은 거리가 최소가 되는 두 가우시안으로부터 평균과 분산을 다시 계산하여 생성하였다. 증권상장 회사명(STOCKNAME) 1,680개의 단어 데이터베이스를 구성하여 실험한 결과 바타챠랴 방법은 $97.2\%$의 인식률을 유지하면서 전체 혼합 가우시안 수의 비율을 $1.0\%$로 감소시켰고, 유클리드 방법은 $96.9\%$의 인식률을 유지하면서 혼합 가우시안 수의 비율을 $1.0\%$로 감소시켜 모델을 최적화할 수 있었다.

Keywords

References

  1. S. Young, D. Kershaw, J. Odell, D. Ollason, Valtcher, P. Woodland, 'The HTK Book, Cambridge University Engineering Department, 2002
  2. L. R. Rabiner, B.H. Juang, 'Fundamentals of speech recognition', Prentice Hall, New Jersey, chap. 6, 1993
  3. L. R. Rabiner, 'A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition,' Pro. IEEE, vol 77, no. 2, pp. 257-286, 1989 https://doi.org/10.1109/5.18626
  4. S. Takahashi. S. Sagayama, 'Four-level tied-structure for efficient representation of acoustic modeling', ICASSP-95, International Conference on , Vol.: 1 , pp. 520-523, May 1995 https://doi.org/10.1109/ICASSP.1995.479643
  5. K. F. Lee,'Context-dependent phonetic hidden Markov models for speaker-independent continuous speech recognition',Acoustics, Speech, and Signal Processing, IEEE Transactions on , Volume: 38 Issue: 4 pp. 599-609, Apr. 1990 https://doi.org/10.1109/29.52701
  6. S. J. Young, J. J. Odell, and P. C. Woodland, 'Tree based state tying forhigh accuracy modeling,' in ARPA Workshop Human Language Technology,Princeton, NJ, pp. 286-291, Mar. 1994 https://doi.org/10.3115/1075812.1075885
  7. J. R. Bellegarda, D. Nahamoo, 'Tied mixture continuous parameter modeling for speech recognition', Acoustics, Speech, and Signal Processing, IEEE Transactions on , Volume: 38 Issue: 12 pp. 2033-2045, Dec. 1990 https://doi.org/10.1109/29.61531
  8. W. Reichl, Wu Chou, 'Robust decision tree state tying for continuous speech recognition', Speech and Audio Processing, IEEE Transactions on , Volume: 8 Issue: 5 pp. 555-566, Sep. 2000 https://doi.org/10.1109/89.861375
  9. A. Karnnan, M. Ostendorf, J.R. Rohlicek, 'Maximum likelihood clustering of Gaussians for speech recognition', Speech and Audio Processing, IEEE Transactions on , Volume: 2 Issue: 3 pp.453-455, Jul. 1994 https://doi.org/10.1109/89.294362
  10. J. J. Odell, 'The use of context in large vocabulary speech recognition', PhD's Dissertation. University of Cambridge. 1995
  11. K. Fukunaga, 'Introduction to statistical pattern recognition', Morgan Kaufman, San Francisco, p.97-99, 1990
  12. 오세진, 황철준, 김범국, 정호열, 정현열, '결정트리 상태 클러스터링에 의한 HM-net 구조결정 알고리즘을 이용한 음성인식에 관한 연구', 한국음향학회지 제 21권 제2호, pp. 199-210, 2002
  13. J. Takami, S. Sagayama, 'A successive state splitting algorithm for efficient allophone modeling', ICASSP-92., p, 573-576, Mar., 1992 https://doi.org/10.1109/ICASSP.1992.225855