A Feature -Based Word Spotting for Content-Based Retrieval of Machine-Printed English Document Images

내용기반의 인쇄체 영문 문서 영상 검색을 위한 특징 기반 단어 검색

  • 정규식 (숭실대학교 정보통신공학부) ;
  • 권희웅 (숭실대학교 전자공학과)
  • Published : 1999.10.01

Abstract

문서영상 검색을 위한 디지털도서관의 대부분은 논문제목과/또는 논문요약으로부터 만들어진 색인에 근거한 제한적인 검색기능을 제공하고 있다. 본 논문에서는 영문 문서영상전체에 대한 검색을 위한 단어 영상 형태 특징기반의 단어검색시스템을 제안한다. 본 논문에서는 검색의 효율성과 정확도를 높이기 위해 1) 기존의 단어검색시스템에서 사용된 특징들을 조합하여 사용하며, 2) 특징의 개수 및 위치뿐만 아니라 특징들의 순서를 포함하여 매칭하는 방법을 사용하며, 3) 특징비교에 의해 검색결과를 얻은 후에 여과목적으로 문자인식을 부분적으로 적용하는 2단계의 검색방법을 사용한다. 제안된 시스템의 동작은 다음과 같다. 문서 영상이 주어지면, 문서 영상 구조가 분석되고 단어 영역들의 조합으로 분할된다. 단어 영상의 특징들이 추출되어 저장된다. 사용자의 텍스트 질의가 주어지면 이에 대응되는 단어 영상이 만들어지며 이로부터 영상특징이 추출된다. 이 참조 특징과 저장된 특징들과 비교하여 유사한 단어를 검색하게 된다. 제안된 시스템은 IBM-PC를 이용한 웹 환경에서 구축되었으며, 영문 문서영상을 이용하여 실험이 수행되었다. 실험결과는 본 논문에서 제안하는 방법들의 유효성을 보여주고 있다. Abstract Most existing digital libraries for document image retrieval provide a limited retrieval service due to their indexing from document titles and/or the content of document abstracts. This paper proposes a word spotting system for full English document image retrieval based on word image shape features. In order to improve not only the efficiency but also the precision of a retrieval system, we develop the system by 1) using a combination of the holistic features which have been used in the existing word spotting systems, 2) performing image matching by comparing the order of features in a word in addition to the number of features and their positions, and 3) adopting 2 stage retrieval strategies by obtaining retrieval results by image feature matching and applying OCR(Optical Charater Recognition) partly to the results for filtering purpose. The proposed system operates as follows: given a document image, its structure is analyzed and is segmented into a set of word regions. Then, word shape features are extracted and stored. Given a user's query with text, features are extracted after its corresponding word image is generated. This reference model is compared with the stored features to find out similar words. The proposed system is implemented with IBM-PC in a web environment and its experiments are performed with English document images. Experimental results show the effectiveness of the proposed methods.

Keywords

References

  1. 4th International Conference on Document Analysis and Recognition The Retrieval of Document Images: A Brief Survey D. Doermann
  2. INTELLIGENT MULTIMEDIA INFORMATION RETRIEVAL Word Spotting: Indexing Handwritten Manuscripts R.Manmatha;W.B. Croft
  3. UNIVERSITY OF COLORADO AT COLORADO SPRINGS TECHNICAL REPORT A line-oriented approach to word spotting in handwritten documents Aleksander Kolez;Joshua Alspector;Marijke Augustin;Robert Carlson;George Viorel Popescu
  4. IEEE Document Image Analysis Workshop Keyword Spotting For Cursive Document Retrieval Patricia Keaton;Hayit Greenspan;Rodney Goodman
  5. Shape,Structure and Pattern Recognition Using character shape codes for word spotting in documents images A.K. Spitz
  6. SRI's Keyword Spotting System
  7. 4th International Conference on Document Analysis and Recognition v.1 Moby Dick meets GEOCR:Lexical Considerations in Word Recognition A. Lawrence Spitz
  8. 3th International Conference on Document Analysis and Recognition An OCR Based on Character Shape Codes and Lexical Information A.Lawrence Spitz
  9. Managing Gigabytes Textual Images Ian H. Witten;Alistair Moffat;Timothy C. Bell
  10. Proc. of Int'l Conf. on Intelligent Text and Image Handling(RIAO '88) Transmedia machine and its keyword search over image texts Y. Tanaka;H.Torii
  11. Journal of Electronic Imaging v.5 no.1 Detection and location of multicharacter sequences in lines of imaged text Francine R. Chen;Dan S.Bloomberg;Lynn D. Wilcox
  12. Proc. of the SPIE-Document Recognition Ⅱ Spotting phrases in lines of imaged text F. R. Chen;D.S. Bloomberg;L.D. Wilcox
  13. ICASP Word spotting in scanned images using hidden markov models F.R. Chen;L.D. Wilcox;D.S. Bloomberg
  14. Proc. of the ICDAR A comparison of discrete and continuous hidden markov models for phrase spotting in text images F.R. Chen;L.D. Wilcox;D.S. Bloomberg
  15. ICASP The use of emphasis to automatically summarize a spoken discourse F.R. Chen;M.M. Withgott
  16. Proc. of the SPIE-Document Recognition Ⅱ Comparison of OCR versus word shape recognition for keyword shape recognition J. Decurtins
  17. Proc. of the SPIE Document Recognition Ⅱ Keyword spotting via word shape recognition J. DeCurtins;E.C. Chen
  18. The Holistic Paradigm in Handwritten Word Recognition: A Brief Survey S.Madhvanath
  19. The Congnitive Neuropsychology of language Reading Without Letters? D. Howard;M. Clotheart(ed.);G. Sartori(ed.);R. Job(ed.)
  20. Cognitivie Psychology: An International Review Developmental Dyslexia P.H.K. Seymour;M.W. Eysenck(ed.)
  21. 영상처리 및 이해에 관한 워크샵 Projection Profile을 이용한 새로운 자동 문서영상의 영역분리 및 분류 알고리즘 조현목;이경무;최영우
  22. 한국정보과학회논문지(B) v.25 no.2 오프라인 필기체 영 숫자 인식에 있어서 특징 성능비교 및 특징 결합에의 응용 윤종민;정규식
  23. Proc. Second Intl. Conf. on Document Analysis and Recognition CD-ROM document database standard I.T. Phillips;S. Chen;R.M. Haralick