Browse > Article

Machine Learning Based Automatic Categorization Model for Text Lines in Invoice Documents  

Shin, Hyun-Kyung (Dept. of Mathematics & Information, Kyungwon University)
Publication Information
Abstract
Automatic understanding of contents in document image is a very hard problem due to involvement with mathematically challenging problems originated mainly from the over-determined system induced by document segmentation process. In both academic and industrial areas, there have been incessant and various efforts to improve core parts of content retrieval technologies by the means of separating out segmentation related issues using semi-structured document, e.g., invoice,. In this paper we proposed classification models for text lines on invoice document in which text lines were clustered into the five categories in accordance with their contents: purchase order header, invoice header, summary header, surcharge header, purchase items. Our investigation was concentrated on the performance of machine learning based models in aspect of linear-discriminant-analysis (LDA) and non-LDA (logic based). In the group of LDA, na$\"{\i}$ve baysian, k-nearest neighbor, and SVM were used, in the group of non LDA, decision tree, random forest, and boost were used. We described the details of feature vector construction and the selection processes of the model and the parameter including training and validation. We also presented the experimental results of comparison on training/classification error levels for the models employed.
Keywords
Text classification; document image analysis; document image understanding; information retrieval; machine learning; CART (classification and regression tree); automatic invoice document processing;
Citations & Related Records
Times Cited By KSCI : 1  (Citation Analysis)
연도 인용수 순위
1 D. J. Hand, H. Mannila, and P. Smyth, "Principles of Data Mining," MIT Press, Cambridge, MA, 2001.
2 H. Sako, M. Seki, N. Furukawa, H. Ikeda and A. Imaizumi, "Form Reading based on Form- type Identification and Formdata Recognition," In International Conference on Document Analysis and Recognition, Edinburgh (Scotland), pp. 926-930, 2003.
3 N. Chen and D. Blostein "A survey of document image classification: problem statement, clas-sifier architecture and performance evaluation," IJDAR, vol.10, pp.1-16, 2007.   DOI   ScienceOn
4 R. R. Picard and R. Dennis Cook, "Cross- Validation of Regression Models," Journal of the American Statistical Association 79 (387): pp. 575-583, 1984.   DOI   ScienceOn
5 Y. Belaïd and A. Belaïd, "Morphological Tagging Approach in Document Analysis of Invoices," Proceedings of the 17th Interna tional Conference on Pattern Recognition (ICPR'04), 2004.
6 H. E. Nielson and W. A. Barrett, "Consensus- Based Table Form Recognition," ICDAR, Edinburgh (Scotland), pp. 906-910 , 2003.
7 F. Cesarini, E. Francesconi, M. Gori and G. Soda, "Analysis and Understanding of Multi-Class Invoices," IJDAR, 2003.
8 H. Shin, "Fast Text Line Segmentation Model Based On DCT For Color Image," KIPS, Volume 17-D, Issues 6, 2010.   과학기술학회마을   DOI
9 D. Ming, J. Liu, and J. Tian, "Research on Chinese financial invoice recognition technology," Pattern Recognition Letters, Vol. 24, Issues 1-3, pp. 489-497, 2003.   DOI   ScienceOn
10 H. Hamza, Y. Belaid and A. Belaid, "Case-Based Reasoning for Invoice Analysis and Recognition," LECTURE NOTES IN COMPUTER SCIENCE, No.4626, pp. 404-418, 2007.   DOI
11 S. B. Kotsiantis, "Supervised Machine Learning: A Review of Classification Techniques," Informatica, Vol. 31, pp. 249-268, 2007.
12 L. Breiman, J. H. Friedman, R. A. Olshen, and C.J. Stone, "Classification and regression trees," Monterey, CA: Wadsworth & Brooks/ Cole Advanced Books & Software, New York, NY, 1984.
13 S. Haykin, "Neural Networks-A Comprehensive Foundation,"second ed. Prentice-Hall Inc., Upper Saddle River, NJ, 1998.
14 S. Büttcher, C. L. A. Clarke, and G. V. Cormack."Information Retrieval: Implementing and Evaluating Search Engines," MIT Press, Cambridge, MA, 2010.
15 Y. Ishitani. "Model-based information extraction method tolerant of OCR errors for document images." Int. J. Comput. Proc. Oriental Lang., vol. 15(2) pp. 165-186, 2002.   DOI
16 H. Baird, D. Lopresti, B. Davison, and W. Pottenger, "Robust document image understanding technologies," Proc. of ACM HDP Workshop, USA, pp. 9-14, 2004.
17 I. Witten, A. Moffat, and T. C. Bell, "Managing Gigabytes: Compressing and Indexing Documents and Images," Second Edition, Morgan Kaufnann Publishiers, New York, NY, 1999.