Word Segmentation in Handwritten Korean Text Lines based on GAP Clustering

GAP 군집화에 기반한 필기 한글 단어 분리

  • Published : 2000.06.15

Abstract

In this paper, a word segmentation method for handwritten Korean text line images is proposed. The method uses gap information to segment words in line images, where the gap is defined as a white run obtained after vertical projection of line images. Each gap is assigned to one of inter-word gap and inter-character gap based on gap distance. We take up three distance measures which have been proposed for the word segmentation of handwritten English text line images. Then we test three clustering techniques to detect the best combination of gap metrics and classification techniques for Korean text line images. The experiment has been done with 305 text line images extracted manually from live mail pieces. The experimental result demonstrates the superiority of BB(Bounding Box) distance measure and sequential clustering approach, in which the cumulative word segmentation accuracy up to the third hypothesis is 88.52%. Given a line image, the processing time is about 0.05 second.

본 논문에서는 필기 한글 문자열 영상에 대한 단어 분리 방법을 제안한다. 제안된 방법은 gap 의 크기 정보를 사용하여 단어를 분리하는데, 이때 gap은 문자열 영상을 수직방향으로 투영한 후 흰-런 (white-run)을 찾음으로써 구할 수 있다. 문자열 영상으로부터 얻어지는 gap들의 크기를 측정한 후, 각각의 gap을 단어와 단어사이에 존재하는 gap과 문자와 문자사이에 존재하는 gap 중 하나로 분류한다. 본 논문에서는 필기 영문 문자열의 단어 분리를 위해 제안된 기존의 세 가지 거리 척도를 채택하고 군집화에 기반한 세 가지 분류방법을 적용하여 한글 문자열의 단어 분리를 위한 최적의 조합을 선정하였다. 우편봉투 상에 작성된 주소열로부터 수작업으로 추출한 305 개의 문자열 영상을 사용하여 실험한 결과 BB(bounding box) 거리를 사용하여 순차적 군집 방법을 적용하는 경우 3 순위까지의 누적 단어 분리 성공률이 88.52% 로서 가장 우수한 성능을 보여 주었다. 또한 하나의 문자열 영상에 대한 단어 분리 속도는 약 0.05초이다.

Keywords

References

  1. S.N. Srihari and E.J. Keubert, 'Integration of hand-written address interpretation technology into the United States Postal Service remote computer reader system,' Proc. 4th International Conference on Document Analysis and Recognition, pp. 892-896, Ulm, Germany, Aug. 1997
  2. S.N. Srihari, Y.C. Shin, V. Ramanaprasad and D.S. Lee, 'A system to read names and addresses on tax forms,' Technical Report CEDAR-TR-94-2, CEDAR, SUNY Buffalo, Oct. 1994
  3. A.J. Elms, S. Procter and J. Illingworth, 'The advantage of using HMM-based approach for faxed word recognition,' International Journal of Document Analysis and Recognition, Vol. 1, No. 1, pp. 18-36, 1998 https://doi.org/10.1007/s100320050003
  4. G. Seni and E. Cohen, 'External word segmentation of off-line handwritten text lines,' Pattern Recognition, Vol. 27, No. 1, pp. 41-52, 1994 https://doi.org/10.1016/0031-3203(94)90016-7
  5. U. Mahadevan and R.C. Nagabushnam, 'Gap metrics for word separation in handwritten lines,' Proc. Third International Conference on Document Analysis and Recognition, pp. 124-127, Montreal, Canada, 1995
  6. G. Kim, 'Architecture for handwritten text recognition systems,' Proc. Sixth International Workshop on Frontiers in Handwritten Recognition, pp. 113-122, Taejon, Korea, August 1998
  7. G. Dzuba, A. Filatov and A. Volgunin, 'Handwritten ZIP code recognition,' Proc. Fourth International Conference on Document Analysis and Recognition, pp. 766-770, Ulm-Germany, August 1997
  8. A.C. Downton, R.W.S. Tregidgo, et al., 'Recognition of handwritten British postal addresses,' From Pixels to Features Ⅲ: Frontiers in Handwriting Recognition, S. Impedovo and J.C. Simon, eds., pp. 129-143, 1992
  9. D. Guillevic and C.Y. Suen, 'Cursive script recognition: A sentence level recognition scheme,' Proc. Fourth International Workshop on Frontiers in Handwritten Recognition, pp. 216-223, Taipei, Taiwan, 1994
  10. J.T. Favata, S.N. Srihari and V. Govindaraju, 'Off-line handwritten sentence recognition,' Proc. Fifth International Workshop on Frontiers in Handwritten Recognition, pp. 171-176, Essex, England, 1996
  11. S.N. Srihari, R.K. Srihari and V. Govindaraju, 'Handwritten text recognition,' Proc. Fourth International Workshop on Frontiers in Handwritten Recognition, pp. 265-274, Taipei, Taiwan, 1994
  12. B. Yanikoglu and P. Sandon, 'Segmentation of off-line cursive handwriting using linear programming,' Pattern Recognition, Vol. 31, No. 12, pp. 1825-1833, 1998 https://doi.org/10.1016/S0031-3203(98)00081-8
  13. G. Kim and V. Govindaraju, 'Handwritten phrase recognition as applied to street name images,' Pattern Recognition, Vol. 31, No. 1, pp. 41-51, 1998 https://doi.org/10.1016/S0031-3203(97)00023-X
  14. 윤정석, 김경환, '시간지연 신경망을 이용한 영문 필기체 단어 분리', 정보과학회 '99 춘계 학술발표 논문집, Vol. 26, No. 1, pp. 490-492, 1999
  15. U. Mahadevan and S.N.Srihari, 'Hypotheses generation for word-separation in handwritten lines,' Proc. Fifth International Workshop on Frontiers in Handwritten Recognition, pp. 453-456, Essex, England, 1996
  16. E. Cohen, J.J. Hull and S.N. Srihari, 'Control structure for interpreting handwritten addresses,' IEEE Transaction on Pattern Analysis and Machine Intelligence, Vol. 16, No. 10, pp. 1049-1055, 1994 https://doi.org/10.1109/34.329003
  17. S.N. Srihari, V. Govindaraju and A. Shekhawat, 'Interpretation of handwritten addresses in US mail stream,' Proc. Second International Conference on Document Analysis and Recognition, pp. 291-294, Tsukuba, Japan, 1993 https://doi.org/10.1109/ICDAR.1993.395729
  18. V. Govindaraju, et al., 'Interpretation of handwritten addresses in US mail stream,' Proc. Third Sixth International Workshop on Frontiers in Handwritten Recognition, pp. 197-206, Buffalo, USA, 1993
  19. P.K. Kim and H.J. Kim, 'Off-line handwritten Korean character recognition based on stroke extraction and representation,' Pattern Recognition Letters, Vol. 15, No. 12, pp. 1245-1253, 1994 https://doi.org/10.1016/0167-8655(94)90115-5
  20. U. Manber, Introduction to Algorithms: A Creative Approach, Addison Wesley, 1989