Distance Measures in HMM Clustering for Large-scale On-line Chinese Character Recognition

대용량 온라인 한자 인식을 위한 클러스터링 거리계산 척도

  • 김광섭 (강원대학교 컴퓨터정보통신공학과) ;
  • 하진영 (강원대학교 컴퓨터학부)
  • Published : 2009.09.15

Abstract

One of the major problems that prevent us from building a good recognition system for large-scale on-line Chinese character recognition using HMMs is increasing recognition time. In this paper, we propose a clustering method to solve recognition speed problem and an efficient distance measure between HMMs. From the experiments, we got about twice the recognition speed and 95.37% 10-candidate recognition accuracy, which is only 0.9% decrease, for 20,902 Chinese characters defined in Unicode CJK unified ideographs.

은닉 마코프 모델(Hidden Markov Model: HMM)에 기반을 둔 온라인 한자 인식에서 클래스의 수가 대용량일 경우에는 인식에 걸리는 시간 증가가 좋은 인식 시스템을 구현하는데 있어서의 걸림돌이 된다. 본 논문에서는 이러한 인식 속도 문제를 해결하고자 HMM을 클러스터링하여 인식 속도를 개선하는 방법과 이에 적합한 효율적인 HMM 간의 거리계산법을 제안한다. 유니코드 한 중 일 통합한자로 정의된 총 20,902개의 한자에 대한 온라인 한자 인식 시스템을 구축하는 실험에서 약 2배 정도로 인식속도가 향상됨을 확인할 수 있었고 클러스터링을 하지 않았을 때보다 0.9%의 인식률만 하락한 95.37%의 10순위 인식률을 달성했다.

Keywords

References

  1. C. C. Tappert, C. Y. Suen, and T. Wakahara, 'The State of the Art in On-Line Handwriting Recognition,' IEEE Trans. Pattern Analysis and Machine Intelligence, vol.12, no.8, pp.787-808, 1990 https://doi.org/10.1109/34.57669
  2. R. Nag, K.H. Wong, and F. Fallside, 'Script Recognition Using Hidden Markov Models,' Proc. TCASSP'86, vol.3, pp.2,071-2,074, 1986
  3. S. Bercu, G. Lorette, 'On-Line Handwritten Word Recognition: An Approach Based on Hidden Markov Models,' Proc. Third IWFHR, pp.385-390, 1993
  4. A. P. Dempster, N. M. Laird and D.B. Rubin, 'Maximum Likelihood Incomplete Data via EM Algorithm,' Journal of the Royal Statistical Society, Series B, vol.39, pp.1-38, 1977
  5. H. Lucke, 'Which Stochastic Models Allow Baum-Welch Training?,' IEEE Trans. Signal Processing, vol.44, no.11, 1996
  6. G. D. Forney, 'The Viterbi algorithm,' Proc. of the IEEE, 61:268-278, 1973 https://doi.org/10.1109/PROC.1973.9030
  7. http://unicode.org/charts/PDF/U4E00.pdf, Unified CJK Ideographs
  8. S. Kullback and R. A. Leibler, 'On information and sufficiency,' Ann. Math. Statist., vol.22, pp. 79-86, 1951 https://doi.org/10.1214/aoms/1177729694
  9. J. Silva, S. Narayanan, 'Average divergence dis-tance as a statistical discrimination measure for hidden Markov models,' IEEE Trans. Audio, Speech, and Language Processing, vol.14, no.3, pp.890-906, 2006 https://doi.org/10.1109/TSA.2005.858059
  10. M. Falkhausen, H. Reininger, and D. Wolf, 'Calculation of distance measures between Hidden Markov Models,' Forth European Conference on Speech Communication and Technology, pp. 1487-1490, 1995
  11. B. H. Juang and L. Rabier, 'A probabilistic distance measure for hidden Markov models,' AT&T Technical Journal, vol.64, no.2, pp. 391-408, 1985 https://doi.org/10.1002/j.1538-7305.1985.tb00439.x
  12. M. Vihola, M. Harju, P. Salmela, J. Suontausta and J. Savela, 'Two dissimilarity measures for HMMs and their application in phoneme model clustering,' in Proc. ICASSP 2002, pp.933-936, 2002
  13. M. N. Do, 'Fast approximation of Kullback-Leibler distance for dependence trees and hidden Markov models,' IEEE Signal Proc. Lett., vol.10, no.4, pp.115-119, 2003 https://doi.org/10.1109/LSP.2003.809034
  14. J. Y. Ha, 'Structure code for HMM Network-based Hangul Recognition,' 18th International Conference on Computer Processing of Oriental Language, pp.106-113, 1999