DOI QR코드

DOI QR Code

Word Extraction from Table Regions in Document Images

문서 영상 내 테이블 영역에서의 단어 추출

  • 정창부 (호남대학교 정보통신대학 인터넷소프트웨어학과) ;
  • 김수형 (전남대학교 공과대학 전자컴퓨터정보통신공학부)
  • Published : 2005.08.01

Abstract

Document image is segmented and classified into text, picture, or table by a document layout analysis, and the words in table regions are significant for keyword spotting because they are more meaningful than the words in other regions. This paper proposes a method to extract words from table regions in document images. As word extraction from table regions is practically regarded extracting words from cell regions composing the table, it is necessary to extract the cell correctly. In the cell extraction module, table frame is extracted first by analyzing connected components, and then the intersection points are extracted from the table frame. We modify the false intersections using the correlation between the neighboring intersections, and extract the cells using the information of intersections. Text regions in the individual cells are located by using the connected components information that was obtained during the cell extraction module, and they are segmented into text lines by using projection profiles. Finally we divide the segmented lines into words using gap clustering and special symbol detection. The experiment performed on In table images that are extracted from Korean documents, and shows $99.16\%$ accuracy of word extraction.

문서 영상은 문서 구조 분석을 통하여 텍스트, 그림, 테이블 등의 세부 영역으로 분할 및 분류되는데, 테이블 영역에 있는 단어는 다른 영역의 단어보다 의미가 있기 때문에 주제어 검색과 같은 응용 분야에서 중요한 역할을 한다. 본 논문에서는 문서 영상의 테이블 영역에 존재하는 문자 성분을 단어단위로 추출하는 방법을 제안한다. 테이블 영역에서의 단어 추출은 실질적으로 테이블을 구성하는 셀 영역에서 단어를 추출하는 것이기 때문에 정확한 셀 추출 과정이 필요하다. 셀 추출은 연결 요소를 분석하여 테이블 프레임을 찾아내고, 교차점 검출은 전체가 아닌 테이블 프레임에 대해서만 수행한다. 잘못 검출된 교차점은 이웃하는 교차점과의 관계를 이용하여 수정하고, 최종 교차점 정보를 이용하여 셀을 추출한다. 추출된 셀 내부에 있는 텍스트 영역은 셀 추출 과정에서 분석한 문자성분의 연결 요소 정보를 재사용하여 결정하고, 결정된 텍스트 영역은 투영 프로파일을 분석하여 문자연로 분리된다. 마지막으로 분리된 문자열에 대하여 갭 군집화와 특수 기호 검출을 수행함으로써 단어 분리를 수행한다. 제안 방법의 성능 평가를 위하여 한글 논문 영상으로부터 추출한 총 In개의 테이블 영상에 대해 실험한 결과, $99.16\%$의 단어 추출 성공률을 얻을 수 있었다.

Keywords

References

  1. D. Doermann, 'The Retrieval of Document Images: A Brief Survey,' Computer Vision and Image Understanding, Vol.70, No.3, pp.287-298, 1998 https://doi.org/10.1006/cviu.1998.0692
  2. S. Marinai, E. Marino, G. Soda, 'Indexing and Retrieval of Words in Old Documents,' Proceedings of the Seventh International Conference on Document Analysis and Recognition, vol.1. pp.223-227, Aug., 2003 https://doi.org/10.1109/ICDAR.2003.1227663
  3. I. S. Oh, Y. S. Choi, J. H. Yang, S. H. Kim, 'A Keyword Spotting System of Korean Document Images,' Lecture Notes in Computer Science 2555(Proc. 5th International Conference on Asian Digital Libraries, Singapore), pp.530, Dec., 2002
  4. S. Marinai, E. Marino, F. Cesarini, G. Soda, 'A General System for the Retrieval of Document Images from Digital Libraries,' Proceedings of the First International Workshop on Document Image Analysis for Libraries, pp.150-173, Jan., 2004 https://doi.org/10.1109/DIAL.2004.1263246
  5. Y. Lu, L. Zhang, C. L. Tan, 'Retrieving Imaged Documents in Digital Libraries Based on Word Image Coding,' Proceedings of the First International Workshop on Document Image Analysis for Libraries, pp.174-187, Jan., 2004 https://doi.org/10.1109/DIAL.2004.1263247
  6. Y. Lu, L. Zhang, C. L. Tan, 'A Search Engine for Imaged Documents in PDF Files,' Proceedings of the 27th annual International Conference on Research and development in Information Retrieval, pp.536-537, July, 2004 https://doi.org/10.1145/1008992.1009108
  7. C. B. Jeong, S. H. Kim, c,' Lecture Notes in Computer Science 3334(Proc. International Conference on Asian Digital Libraries 2004, Shanghai, China), pp.440-443, Dec., 2004
  8. S. Taylor, R. Fritzson, J. Pastor, 'Extraction of Data from Pre-printed Forms,' Machine Vision and Applications, Vol.5, No.3, pp.211-222, 1992 https://doi.org/10.1007/BF02626999
  9. T. Watanabe, Q. Luo, N. Sugie, 'Layout Recognition of Multi-Kinds of Table-Form Documents,' IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol.17, pp.432-445, April, 1995 https://doi.org/10.1109/34.385976
  10. W. S. Kim, J. B. Shim, Y. B. Park, K. A. Moon, S. Y. Ji, 'Research on the Table Vectorization in the Document Image,' Journal of Korea Information Processing Society, Vol.3, No.5, pp.1147-1159, Aug., 1996(text in Korean)
  11. J. F. Arias, R. Kasturi, 'Efficient Extraction of Primitives from Line Drawings Composed of Horizontal and Vertical Lines,' Machine Vision and Applications archive, Vol.10, pp.214-221, Dec., 1997 https://doi.org/10.1007/s001380050073
  12. L. Y. Tseng, R. C. Chen, 'Recognition and Data Extraction of Form Documents based on Three Types of Line Segments,' Pattern Recognition, Vol.31, No.10, pp.1525-1540, 1998 https://doi.org/10.1016/S0031-3203(98)00007-7
  13. S. H. Lee, K. M. Lee, 'Table Extraction and Analysis Algorithm from Document Images,' Hongik Journal of Science and Technology, Vol.2, pp.129-138, Dec., 1998
  14. L. A. P. Neves, J. Facon, 'Methodology of Automatic Extraction of Table-Form Cells,' XIII Brazilian Symposium on Computer Graphics and Image Processing (SIBGRAPI'00), pp.15-21, Oct., 2000 https://doi.org/10.1109/SIBGRA.2000.883888
  15. K. C. Fan, M. L. Chang, 'Form Document Identification using Line Structure based Features,' Proceedings of the Sixth International Conference on Document Analysis and Recognition, Vol.1, pp.704-709, Sept., 2001 https://doi.org/10.1109/ICDAR.2001.953881
  16. D. Xi, S. W. Lee, 'Reference Line Extraction from Form Documents with Complicated Backgrounds,' Proceedings of the Seventh International Conference on Document Analysis and Recognition, Vol.2, pp.1080-1084, Aug., 2003 https://doi.org/10.1109/ICDAR.2003.1227823
  17. J. H. Shamilian, H. S. Baird, T. L. Wood, 'A Retargetable Table Reader,' Proceedings of the 4th International Conference on Document Analysis and Recognition, pp.158-163, Aug., 1997 https://doi.org/10.1109/ICDAR.1997.619833
  18. D. Lopresti, G. Nagy, 'A Tabular Survey of Automated Table Processing,' Lecture Notes In Computer Science, Vol.1941, pp.93-120, 1999