Browse > Article
http://dx.doi.org/10.3745/KIPSTB.2005.12B.4.369

Word Extraction from Table Regions in Document Images  

Jeong, Chang-Bu (호남대학교 정보통신대학 인터넷소프트웨어학과)
Kim, Soo-Hyung (전남대학교 공과대학 전자컴퓨터정보통신공학부)
Abstract
Document image is segmented and classified into text, picture, or table by a document layout analysis, and the words in table regions are significant for keyword spotting because they are more meaningful than the words in other regions. This paper proposes a method to extract words from table regions in document images. As word extraction from table regions is practically regarded extracting words from cell regions composing the table, it is necessary to extract the cell correctly. In the cell extraction module, table frame is extracted first by analyzing connected components, and then the intersection points are extracted from the table frame. We modify the false intersections using the correlation between the neighboring intersections, and extract the cells using the information of intersections. Text regions in the individual cells are located by using the connected components information that was obtained during the cell extraction module, and they are segmented into text lines by using projection profiles. Finally we divide the segmented lines into words using gap clustering and special symbol detection. The experiment performed on In table images that are extracted from Korean documents, and shows $99.16\%$ accuracy of word extraction.
Keywords
Document Image Retrieval; Document Image Preprocessing; OCR; Word Segmentation;
Citations & Related Records
Times Cited By KSCI : 1  (Citation Analysis)
연도 인용수 순위
1 L. A. P. Neves, J. Facon, 'Methodology of Automatic Extraction of Table-Form Cells,' XIII Brazilian Symposium on Computer Graphics and Image Processing (SIBGRAPI'00), pp.15-21, Oct., 2000   DOI
2 K. C. Fan, M. L. Chang, 'Form Document Identification using Line Structure based Features,' Proceedings of the Sixth International Conference on Document Analysis and Recognition, Vol.1, pp.704-709, Sept., 2001   DOI
3 D. Xi, S. W. Lee, 'Reference Line Extraction from Form Documents with Complicated Backgrounds,' Proceedings of the Seventh International Conference on Document Analysis and Recognition, Vol.2, pp.1080-1084, Aug., 2003   DOI
4 L. Y. Tseng, R. C. Chen, 'Recognition and Data Extraction of Form Documents based on Three Types of Line Segments,' Pattern Recognition, Vol.31, No.10, pp.1525-1540, 1998   DOI   ScienceOn
5 S. H. Lee, K. M. Lee, 'Table Extraction and Analysis Algorithm from Document Images,' Hongik Journal of Science and Technology, Vol.2, pp.129-138, Dec., 1998
6 T. Watanabe, Q. Luo, N. Sugie, 'Layout Recognition of Multi-Kinds of Table-Form Documents,' IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol.17, pp.432-445, April, 1995   DOI   ScienceOn
7 W. S. Kim, J. B. Shim, Y. B. Park, K. A. Moon, S. Y. Ji, 'Research on the Table Vectorization in the Document Image,' Journal of Korea Information Processing Society, Vol.3, No.5, pp.1147-1159, Aug., 1996(text in Korean)   과학기술학회마을
8 J. F. Arias, R. Kasturi, 'Efficient Extraction of Primitives from Line Drawings Composed of Horizontal and Vertical Lines,' Machine Vision and Applications archive, Vol.10, pp.214-221, Dec., 1997   DOI
9 C. B. Jeong, S. H. Kim, c,' Lecture Notes in Computer Science 3334(Proc. International Conference on Asian Digital Libraries 2004, Shanghai, China), pp.440-443, Dec., 2004
10 S. Taylor, R. Fritzson, J. Pastor, 'Extraction of Data from Pre-printed Forms,' Machine Vision and Applications, Vol.5, No.3, pp.211-222, 1992   DOI
11 Y. Lu, L. Zhang, C. L. Tan, 'A Search Engine for Imaged Documents in PDF Files,' Proceedings of the 27th annual International Conference on Research and development in Information Retrieval, pp.536-537, July, 2004   DOI
12 D. Doermann, 'The Retrieval of Document Images: A Brief Survey,' Computer Vision and Image Understanding, Vol.70, No.3, pp.287-298, 1998   DOI   ScienceOn
13 J. H. Shamilian, H. S. Baird, T. L. Wood, 'A Retargetable Table Reader,' Proceedings of the 4th International Conference on Document Analysis and Recognition, pp.158-163, Aug., 1997   DOI
14 D. Lopresti, G. Nagy, 'A Tabular Survey of Automated Table Processing,' Lecture Notes In Computer Science, Vol.1941, pp.93-120, 1999
15 I. S. Oh, Y. S. Choi, J. H. Yang, S. H. Kim, 'A Keyword Spotting System of Korean Document Images,' Lecture Notes in Computer Science 2555(Proc. 5th International Conference on Asian Digital Libraries, Singapore), pp.530, Dec., 2002
16 S. Marinai, E. Marino, F. Cesarini, G. Soda, 'A General System for the Retrieval of Document Images from Digital Libraries,' Proceedings of the First International Workshop on Document Image Analysis for Libraries, pp.150-173, Jan., 2004   DOI
17 Y. Lu, L. Zhang, C. L. Tan, 'Retrieving Imaged Documents in Digital Libraries Based on Word Image Coding,' Proceedings of the First International Workshop on Document Image Analysis for Libraries, pp.174-187, Jan., 2004   DOI
18 S. Marinai, E. Marino, G. Soda, 'Indexing and Retrieval of Words in Old Documents,' Proceedings of the Seventh International Conference on Document Analysis and Recognition, vol.1. pp.223-227, Aug., 2003   DOI