Browse > Article
http://dx.doi.org/10.3745/KTSDE.2018.7.11.411

Automatic Object Extraction from Electronic Documents Using Deep Neural Network  

Jang, Heejin (한국과학기술정보연구원 과학기술연구망센터)
Chae, Yeonghun (한국과학기술정보연구원 과학기술연구망센터)
Lee, Sangwon (한국과학기술원 생명화학공학과)
Jo, Jinyong (한국과학기술정보연구원 과학기술연구망센터)
Publication Information
KIPS Transactions on Software and Data Engineering / v.7, no.11, 2018 , pp. 411-418 More about this Journal
Abstract
With the proliferation of artificial intelligence technology, it is becoming important to obtain, store, and utilize scientific data in research and science sectors. A number of methods for extracting meaningful objects such as graphs and tables from research articles have been proposed to eventually obtain scientific data. Existing extraction methods using heuristic approaches are hardly applicable to electronic documents having heterogeneous manuscript formats because they are designed to work properly for some targeted manuscripts. This paper proposes a prototype of an object extraction system which exploits a recent deep-learning technology so as to overcome the inflexibility of the heuristic approaches. We implemented our trained model, based on the Faster R-CNN algorithm, using the Google TensorFlow Object Detection API and also composed an annotated data set from 100 research articles for training and evaluation. Finally, a performance evaluation shows that the proposed system outperforms a comparator adopting heuristic approaches by 5.2%.
Keywords
Object Extraction; Deep Learning; Tensorflow; PDF Document;
Citations & Related Records
연도 인용수 순위
  • Reference
1 C. Clark and S. Divvala, "PDFFigures 2.0: Mining figures from research papers," in Proceedings of IEEE/ACM Joint Conference on Digital Libraries (JCDL), pp.143-152, 2016.
2 J. Wu et al., "Pdfmef: A multi-entity knowledge extraction framework for scholarly documents and semantic search," in Proceedings of the 8th International Conference on Knowledge Capture, Article No.13, 2015.
3 S. Ray Choudhury, P. Mitra, and C. L. Giles, "Automatic extraction of figures from scholarly documents," in Proceedings of the 2015 ACM Symposium on Document Engineering, pp.47-50, 2015.
4 S. J. Chalk, "ChemExtractor: Enhanced Rule-Based Capture and Identification of PDF Based Property Data," 253rd American Chemistry Society (ACS) National Meeting, 2017.
5 S. Klampfl and R. Kern, "Machine learning techniques for automatically extracting contextual information from scientific publications," Semantic Web Evaluation Challenge, Springer, pp.105-116, 2015.
6 P. Lopez, "GROBID: Combining automatic bibliographic data recognition and term extraction for scholarship publications," in Proceedings of International Conference on Theory and Practice of Digital Libraries, pp.473-474, 2009.
7 M. Aristaran, Extract Tables from PDFs [Internet], http://tabula.technology.
8 Y. Shinyama, PDFMiner: Python PDF Parser and Analyser [Internet], http://www.unixuser.org/-euske/python/pdfminer/.
9 Apache PDFBox: A Java PDF Library [Internet], https://pdfbox.apache.org/.
10 Pdftohtml [Internet], http://pdftohtml.sourceforge.net.
11 Poppler: a PDF rendering library based on the xpdf-3.0 code base [Internet], https://poppler.freedesktop.org/.
12 A. E. Jinha, "Article 50 million: an estimate of the number of scholarly articles in existence," Learned Publishing, Vol.23, No.3, pp.258-263, 2010.   DOI
13 254th American Chemical Society National Meeting and Expo [Internet], http://washingtondc2017.acs.org/t/ 197077- acs-national-meeting-washington-dc-2017.
14 S. Ren, K. He, R. Girshick, and J. Sun, "Faster R-CNN: towards real-time object detection with region proposal networks," IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol.39, No.6, pp.1137-1149, 2017.   DOI
15 E. E. Bolton, Y. Wang, P. A. Thiessen, and S. H. Bryant, "PubChem: integrated platform of small molecules and biological activities," in Annual reports in computational chemistry, Elsevier, Vol.4, pp.217-241, 2008.
16 R. Zakharov, V. Tkacheonko, A. Korotcov, I. Presniakov, and S. Kalmykov, "Open Science Data Repository: The platform for materials research," 253rd American Chemistry Society (ACS) National Meeting, 2017.
17 Open Chemistry [Internet], https://www.openchemistry.org/.
18 M. Lipinski, K. Yao, C. Breitinger, J. Beel, and B. Gipp, "Evaluation of header metadata extraction approaches and tools for scientific PDF documents," in Proceedings of the 13th ACM/IEEE-CS Joint Conference on Digital Libraries, pp.385-386, 2013.
19 P. Lopez and L. Romary, "HUMB: Automatic key term extraction from scientific articles in GROBID," in Proceedings of the 5th International Workshop on Semantic Evaluation, pp.248-251, 2010.
20 I. G. Councill, C. L. Giles, and M.-Y. Kan, "ParsCit: an Open-source CRF Reference String Parsing Package," in Proceedings of the Language Resources and Evaluation Conference (LREC 08), Vol.8, pp.661-667, 2008.
21 TensorFlow Object Detection API [Internet], https:// research.googleblog.com/2017/06/.
22 K. He, et al., "Deep residual learning for image recognition," in Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.770-778, 2016.
23 I. Sutskever, J. Martens, G. Dahl, and G. Hinton, "On the importance of initialization and momentum in deep learning," in Proceedings of International Conference on Machine Learning, pp.1139-1147, 2013.
24 D. A. Lawson, et al., "Single-cell analysis reveals a stemcell program in human metastatic breast cancer cells," Nature, Vol.526, No.7571, pp.131-135, 2015.   DOI
25 J. Huang et al., "Speed/accuracy trade-offs for modern convolutional object detectors," in Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3296-3305, 2017.
26 J. R. Uijlings, K. E. Van De Sande, T. Gevers, and A. W. Smeulders, "Selective search for object recognition," International Journal of Computer Vision, Vol.104, No. 2, pp.154-171, 2013.   DOI
27 K. Barre, et al., "Tillage and herbicide reduction mitigate the gap between conventional and organic farming effects on foraging activity of insectivorous bats," Ecology and Evolution, Vol.8, No.3, pp.1496-1506, 2018.   DOI
28 A. Kuznetsova, P. B. Brockhoff, and R. H. Christensen, "lmerTest package: Tests in linear mixed effects models," Journal of Statistical Software, Vol.82, No.13, pp.1-26, 2017.
29 N. Wahlström, et al., "A Strategy for the Sequential Recovery of Biomacromolecules from Red Macroalgae Porphyra umbilicalis Kützing," Industrial & Engineering Chemistry Research, Vol.57, No.1, pp.42-52, 2017.   DOI