DOI QR코드

DOI QR Code

The Effect of the Number of Phoneme Clusters on Speech Recognition

음성 인식에서 음소 클러스터 수의 효과

  • Received : 2014.08.11
  • Accepted : 2014.11.10
  • Published : 2014.11.30

Abstract

In an effort to improve the efficiency of the speech recognition, we investigate the effect of the number of phoneme clusters. For this purpose, codebooks of varied number of phoneme clusters are prepared by modified k-means clustering algorithm. The subsequent processing is fuzzy vector quantization (FVQ) and hidden Markov model (HMM) for speech recognition test. The result shows that there are two distinct regimes. For large number of phoneme clusters, the recognition performance is roughly independent of it. For small number of phoneme clusters, however, the recognition error rate increases nonlinearly as it is decreased. From numerical calculation, it is found that this nonlinear regime might be modeled by a power law function. The result also shows that about 166 phoneme clusters would be the optimal number for recognition of 300 isolated words. This amounts to roughly 3 variations per phoneme.

본 논문에서는 음성 인식의 효율을 높이기 위하여 음소 클러스터 개수의 효과에 대해 연구하였다. 이를 위하여 음소 클러스터 개수를 바꾸어 가면서 수정된 k-평균 군집 알고리듬을 사용하여 코우드북을 작성하였다. 그런 다음, 퍼지 벡터 양자화와 은닉 마코브 모델을 사용하여 음성인식 테스트를 수행하였다. 실험 결과 두 개의 영역이 구분되어 나타났다. 음소 클러스터 개수가 클 때 인식 성능은 대체로 그와 무관하지만, 개수가 작을 때에는 그 감소와 더불어 인식 오류율이 비선형적으로 증가하는 것으로 나타났다. 수치 해석적 계산으로부터, 이 비선형 영역은 멱승함수에 의해 모델링 될 수 있었다. 또한 300개의 고립단어 인식의 경우에, 166개의 음소클러스터가 최적의 수임을 보일 수 있었다. 이는 음소당 3개 정도의 변화에 해당하는 값이다.

Keywords

References

  1. Y. Chang, S. Hung, N. Wang, and B. Lin, "CSR: A Cloud-assisted speech recognition service for personal mobile device," Int. Conf. on Parallel Processing, Taipei, Taiwan, Sept. 2011, pp. 305-314.
  2. M. Kang, "A Study on the Design of Multimedia Service Platform on Wireless Intelligent Technology," J. of the Korea Institute of Electronic Communication Sciences, vol. 4, no. 1, 2009, pp. 24-30.
  3. J. Yoo, H. Park, H. Shin, and Y. Shin, "A Study of the Communication Infrastructure Construction for u-City in Korea," J. of the Korea Institute of Electronic Communication Sciences, vol. 1, no. 2, 2006, pp. 127-135.
  4. B. Kim, "Service Quality Criteria for Voice Services over a WiBro Network," J. of the Korea Institute of Electronic Communication Sciences, vol. 6, no. 6, 2011, pp. 823-829.
  5. G. Kaplan, "Words Into Action I," IEEE Spectrum, vol. 17, 1980, pp. 22-26.
  6. L. Rabiner and B. Juang, Fundamentals of Speech Recognition. Englewood Cliffs, New Jersey : Prentice Hall, 1993.
  7. J. Deller, J. Proakis, and J. Hansen, Discrete Time Processing of Speech Signals. New York : Macmillan, 1993, pp. 115-119.
  8. L. Fausett, Fundamentals of Neural Networks. Englewood Cliffs, New Jersey : Prentice Hall, 1994.
  9. J.-C. Wang, J.-F. Wang, and Y. Weng, "Chip design of MFCC extraction for speech recognition," The VLSI J., vol. 32, 2002, pp. 111-131. https://doi.org/10.1016/S0167-9260(02)00045-7
  10. M. K. Pakhira, "A Modified k-means Algorithm to Avoid Empty Clusters," Int. J. of Recent Trends in Engineering, vol. 1, no. 1, 2009, pp. 220-226.
  11. M. Dehghan, K. Faez, M. Ahmadi, and M. Shridhar, "Unconstrained Farsi Handwritten Word Recognition Using Fuzzy Vector Quantization and Hidden Markov Models," Pattern Recognition Letters, vol. 22, 2001, pp. 209-214. https://doi.org/10.1016/S0167-8655(00)00090-8
  12. S. E. Levinson, L. R. Rabiner, and M. Sondhi, "An Introduction to the Application of the Theory of Probabilistic Functions of a Markov Process to Automatic Speech Recognition," Bell Systems Tech. J., vol. 62, no. 4, 1983, pp. 1035-1074. https://doi.org/10.1002/j.1538-7305.1983.tb03114.x