DOI QR코드

DOI QR Code

Automatic Object Extraction from Electronic Documents Using Deep Neural Network

심층 신경망을 활용한 전자문서 내 객체의 자동 추출 방법 연구

  • 장희진 (한국과학기술정보연구원 과학기술연구망센터) ;
  • 채영훈 (한국과학기술정보연구원 과학기술연구망센터) ;
  • 이상원 (한국과학기술원 생명화학공학과) ;
  • 조진용 (한국과학기술정보연구원 과학기술연구망센터)
  • Received : 2018.03.27
  • Accepted : 2018.06.07
  • Published : 2018.11.30

Abstract

With the proliferation of artificial intelligence technology, it is becoming important to obtain, store, and utilize scientific data in research and science sectors. A number of methods for extracting meaningful objects such as graphs and tables from research articles have been proposed to eventually obtain scientific data. Existing extraction methods using heuristic approaches are hardly applicable to electronic documents having heterogeneous manuscript formats because they are designed to work properly for some targeted manuscripts. This paper proposes a prototype of an object extraction system which exploits a recent deep-learning technology so as to overcome the inflexibility of the heuristic approaches. We implemented our trained model, based on the Faster R-CNN algorithm, using the Google TensorFlow Object Detection API and also composed an annotated data set from 100 research articles for training and evaluation. Finally, a performance evaluation shows that the proposed system outperforms a comparator adopting heuristic approaches by 5.2%.

인공지능 기술의 확산으로 인해 과학기술 분야에서도 연구 데이터의 확보, 저장 및 활용이 중요시 되고 있는 상황이다. 연구 데이터를 확보하기 위해 전자문서 형태의 연구논문으로부터 그래프, 표와 같은 유의미한 객체를 추출하는 다양한 방법들이 제안되고 있다. 경험적 방법론을 이용하는 기존의 연구들은 문서의 편집 특성을 일반화하여 객체들을 추출하기 때문에 다수의 이질적인 형태를 갖는 전자문서들을 대상으로 연구결과를 적용하는데는 한계가 있다. 본 논문은 경험적 방법론의 경직성을 극복하고 이질적인 전자문서들로부터 목표 객체들을 효과적으로 추출하기 위해 심층 학습 기반의 객체 추출 시스템을 제안한다. 텐서플로우 객체 탐지 API의 Faster R-CNN 알고리즘을 기반으로 새로운 학습 모델을 생성했으며 심층 학습과 평가를 위해 총 100여 편의 연구논문들을 대상으로 목표 객체들을 데이터화했다. 마지막으로 성능평가를 통해 제안한 시스템이 경험적 방법론을 적용한 비교 대상에 비해 약 5.2% 높은 성능을 보임을 확인하였다.

Keywords

JBCRJM_2018_v7n11_411_f0001.png 이미지

Fig. 1. Overview of the Proposed System

JBCRJM_2018_v7n11_411_f0002.png 이미지

Fig. 2. Example annotation (Reproduced from Lawson et al. Nature 2015;526(7571):131-5, with Permission of Springer Nature[23])

JBCRJM_2018_v7n11_411_f0003.png 이미지

Fig. 3. Pseudo Code to Generate a TFRecord File

JBCRJM_2018_v7n11_411_f0004.png 이미지

Fig. 4. Average Precision for a Validation Set

JBCRJM_2018_v7n11_411_f0005.png 이미지

Fig. 5. Correctly Extracted Figure, Table, and Caption ((a) Reproduced from Barre et al. Ecol Evol 2018;8(3): 1496-1501[27], (b) Kuznetsova et al. J Stat Softw 2017;82(13):1-26[28])

JBCRJM_2018_v7n11_411_f0006.png 이미지

Fig. 6. Incorrect Object Extraction of the Proposed System ((a) Reproduced from Lawson et al. Nature 2015;526(7571): 131-5, with Permission of Springer Nature[23], (b) Wahlström et al. Ind Eng Chem Res 2017;57(1):42-53[29])

JBCRJM_2018_v7n11_411_f0007.png 이미지

Fig. 7. Incorrect object extraction of the PDFFigures ((a) Reproduced from Lawson et al. Nature 2015;526(7571):131-5, with permission of Springer Nature[23])

Table 1. The Number and Type of Objects in a Training Set

JBCRJM_2018_v7n11_411_t0001.png 이미지

Table 2. Deep Learning Environment for Object Detection

JBCRJM_2018_v7n11_411_t0002.png 이미지

Table 3. Deep Learning Parameters

JBCRJM_2018_v7n11_411_t0003.png 이미지

Table 4. The Number and Type of Target Objects in the Evaluation Set

JBCRJM_2018_v7n11_411_t0004.png 이미지

Table 5. Issued Year of Articles and Graphic Format of PDF Files

JBCRJM_2018_v7n11_411_t0005.png 이미지

Table 6. Performance Comparison

JBCRJM_2018_v7n11_411_t0006.png 이미지

References

  1. C. Clark and S. Divvala, "PDFFigures 2.0: Mining figures from research papers," in Proceedings of IEEE/ACM Joint Conference on Digital Libraries (JCDL), pp.143-152, 2016.
  2. J. Wu et al., "Pdfmef: A multi-entity knowledge extraction framework for scholarly documents and semantic search," in Proceedings of the 8th International Conference on Knowledge Capture, Article No.13, 2015.
  3. S. Ray Choudhury, P. Mitra, and C. L. Giles, "Automatic extraction of figures from scholarly documents," in Proceedings of the 2015 ACM Symposium on Document Engineering, pp.47-50, 2015.
  4. S. J. Chalk, "ChemExtractor: Enhanced Rule-Based Capture and Identification of PDF Based Property Data," 253rd American Chemistry Society (ACS) National Meeting, 2017.
  5. S. Klampfl and R. Kern, "Machine learning techniques for automatically extracting contextual information from scientific publications," Semantic Web Evaluation Challenge, Springer, pp.105-116, 2015.
  6. P. Lopez, "GROBID: Combining automatic bibliographic data recognition and term extraction for scholarship publications," in Proceedings of International Conference on Theory and Practice of Digital Libraries, pp.473-474, 2009.
  7. M. Aristaran, Extract Tables from PDFs [Internet], http://tabula.technology.
  8. Y. Shinyama, PDFMiner: Python PDF Parser and Analyser [Internet], http://www.unixuser.org/-euske/python/pdfminer/.
  9. Apache PDFBox: A Java PDF Library [Internet], https://pdfbox.apache.org/.
  10. Pdftohtml [Internet], http://pdftohtml.sourceforge.net.
  11. Poppler: a PDF rendering library based on the xpdf-3.0 code base [Internet], https://poppler.freedesktop.org/.
  12. A. E. Jinha, "Article 50 million: an estimate of the number of scholarly articles in existence," Learned Publishing, Vol.23, No.3, pp.258-263, 2010. https://doi.org/10.1087/20100308
  13. 254th American Chemical Society National Meeting and Expo [Internet], http://washingtondc2017.acs.org/t/ 197077- acs-national-meeting-washington-dc-2017.
  14. E. E. Bolton, Y. Wang, P. A. Thiessen, and S. H. Bryant, "PubChem: integrated platform of small molecules and biological activities," in Annual reports in computational chemistry, Elsevier, Vol.4, pp.217-241, 2008.
  15. R. Zakharov, V. Tkacheonko, A. Korotcov, I. Presniakov, and S. Kalmykov, "Open Science Data Repository: The platform for materials research," 253rd American Chemistry Society (ACS) National Meeting, 2017.
  16. Open Chemistry [Internet], https://www.openchemistry.org/.
  17. S. Ren, K. He, R. Girshick, and J. Sun, "Faster R-CNN: towards real-time object detection with region proposal networks," IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol.39, No.6, pp.1137-1149, 2017. https://doi.org/10.1109/TPAMI.2016.2577031
  18. M. Lipinski, K. Yao, C. Breitinger, J. Beel, and B. Gipp, "Evaluation of header metadata extraction approaches and tools for scientific PDF documents," in Proceedings of the 13th ACM/IEEE-CS Joint Conference on Digital Libraries, pp.385-386, 2013.
  19. P. Lopez and L. Romary, "HUMB: Automatic key term extraction from scientific articles in GROBID," in Proceedings of the 5th International Workshop on Semantic Evaluation, pp.248-251, 2010.
  20. I. G. Councill, C. L. Giles, and M.-Y. Kan, "ParsCit: an Open-source CRF Reference String Parsing Package," in Proceedings of the Language Resources and Evaluation Conference (LREC 08), Vol.8, pp.661-667, 2008.
  21. TensorFlow Object Detection API [Internet], https:// research.googleblog.com/2017/06/.
  22. K. He, et al., "Deep residual learning for image recognition," in Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.770-778, 2016.
  23. D. A. Lawson, et al., "Single-cell analysis reveals a stemcell program in human metastatic breast cancer cells," Nature, Vol.526, No.7571, pp.131-135, 2015. https://doi.org/10.1038/nature15260
  24. J. Huang et al., "Speed/accuracy trade-offs for modern convolutional object detectors," in Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3296-3305, 2017.
  25. J. R. Uijlings, K. E. Van De Sande, T. Gevers, and A. W. Smeulders, "Selective search for object recognition," International Journal of Computer Vision, Vol.104, No. 2, pp.154-171, 2013. https://doi.org/10.1007/s11263-013-0620-5
  26. I. Sutskever, J. Martens, G. Dahl, and G. Hinton, "On the importance of initialization and momentum in deep learning," in Proceedings of International Conference on Machine Learning, pp.1139-1147, 2013.
  27. K. Barre, et al., "Tillage and herbicide reduction mitigate the gap between conventional and organic farming effects on foraging activity of insectivorous bats," Ecology and Evolution, Vol.8, No.3, pp.1496-1506, 2018. https://doi.org/10.1002/ece3.3688
  28. A. Kuznetsova, P. B. Brockhoff, and R. H. Christensen, "lmerTest package: Tests in linear mixed effects models," Journal of Statistical Software, Vol.82, No.13, pp.1-26, 2017.
  29. N. Wahlström, et al., "A Strategy for the Sequential Recovery of Biomacromolecules from Red Macroalgae Porphyra umbilicalis Kützing," Industrial & Engineering Chemistry Research, Vol.57, No.1, pp.42-52, 2017. https://doi.org/10.1021/acs.iecr.7b03768