Fig. 1. Overview of the Proposed System
Fig. 2. Example annotation (Reproduced from Lawson et al. Nature 2015;526(7571):131-5, with Permission of Springer Nature[23])
Fig. 3. Pseudo Code to Generate a TFRecord File
Fig. 4. Average Precision for a Validation Set
Fig. 5. Correctly Extracted Figure, Table, and Caption ((a) Reproduced from Barre et al. Ecol Evol 2018;8(3): 1496-1501[27], (b) Kuznetsova et al. J Stat Softw 2017;82(13):1-26[28])
Fig. 6. Incorrect Object Extraction of the Proposed System ((a) Reproduced from Lawson et al. Nature 2015;526(7571): 131-5, with Permission of Springer Nature[23], (b) Wahlström et al. Ind Eng Chem Res 2017;57(1):42-53[29])
Fig. 7. Incorrect object extraction of the PDFFigures ((a) Reproduced from Lawson et al. Nature 2015;526(7571):131-5, with permission of Springer Nature[23])
Table 1. The Number and Type of Objects in a Training Set
Table 2. Deep Learning Environment for Object Detection
Table 3. Deep Learning Parameters
Table 4. The Number and Type of Target Objects in the Evaluation Set
Table 5. Issued Year of Articles and Graphic Format of PDF Files
Table 6. Performance Comparison
References
- C. Clark and S. Divvala, "PDFFigures 2.0: Mining figures from research papers," in Proceedings of IEEE/ACM Joint Conference on Digital Libraries (JCDL), pp.143-152, 2016.
- J. Wu et al., "Pdfmef: A multi-entity knowledge extraction framework for scholarly documents and semantic search," in Proceedings of the 8th International Conference on Knowledge Capture, Article No.13, 2015.
- S. Ray Choudhury, P. Mitra, and C. L. Giles, "Automatic extraction of figures from scholarly documents," in Proceedings of the 2015 ACM Symposium on Document Engineering, pp.47-50, 2015.
- S. J. Chalk, "ChemExtractor: Enhanced Rule-Based Capture and Identification of PDF Based Property Data," 253rd American Chemistry Society (ACS) National Meeting, 2017.
- S. Klampfl and R. Kern, "Machine learning techniques for automatically extracting contextual information from scientific publications," Semantic Web Evaluation Challenge, Springer, pp.105-116, 2015.
- P. Lopez, "GROBID: Combining automatic bibliographic data recognition and term extraction for scholarship publications," in Proceedings of International Conference on Theory and Practice of Digital Libraries, pp.473-474, 2009.
- M. Aristaran, Extract Tables from PDFs [Internet], http://tabula.technology.
- Y. Shinyama, PDFMiner: Python PDF Parser and Analyser [Internet], http://www.unixuser.org/-euske/python/pdfminer/.
- Apache PDFBox: A Java PDF Library [Internet], https://pdfbox.apache.org/.
- Pdftohtml [Internet], http://pdftohtml.sourceforge.net.
- Poppler: a PDF rendering library based on the xpdf-3.0 code base [Internet], https://poppler.freedesktop.org/.
- A. E. Jinha, "Article 50 million: an estimate of the number of scholarly articles in existence," Learned Publishing, Vol.23, No.3, pp.258-263, 2010. https://doi.org/10.1087/20100308
- 254th American Chemical Society National Meeting and Expo [Internet], http://washingtondc2017.acs.org/t/ 197077- acs-national-meeting-washington-dc-2017.
- E. E. Bolton, Y. Wang, P. A. Thiessen, and S. H. Bryant, "PubChem: integrated platform of small molecules and biological activities," in Annual reports in computational chemistry, Elsevier, Vol.4, pp.217-241, 2008.
- R. Zakharov, V. Tkacheonko, A. Korotcov, I. Presniakov, and S. Kalmykov, "Open Science Data Repository: The platform for materials research," 253rd American Chemistry Society (ACS) National Meeting, 2017.
- Open Chemistry [Internet], https://www.openchemistry.org/.
- S. Ren, K. He, R. Girshick, and J. Sun, "Faster R-CNN: towards real-time object detection with region proposal networks," IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol.39, No.6, pp.1137-1149, 2017. https://doi.org/10.1109/TPAMI.2016.2577031
- M. Lipinski, K. Yao, C. Breitinger, J. Beel, and B. Gipp, "Evaluation of header metadata extraction approaches and tools for scientific PDF documents," in Proceedings of the 13th ACM/IEEE-CS Joint Conference on Digital Libraries, pp.385-386, 2013.
- P. Lopez and L. Romary, "HUMB: Automatic key term extraction from scientific articles in GROBID," in Proceedings of the 5th International Workshop on Semantic Evaluation, pp.248-251, 2010.
- I. G. Councill, C. L. Giles, and M.-Y. Kan, "ParsCit: an Open-source CRF Reference String Parsing Package," in Proceedings of the Language Resources and Evaluation Conference (LREC 08), Vol.8, pp.661-667, 2008.
- TensorFlow Object Detection API [Internet], https:// research.googleblog.com/2017/06/.
- K. He, et al., "Deep residual learning for image recognition," in Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.770-778, 2016.
- D. A. Lawson, et al., "Single-cell analysis reveals a stemcell program in human metastatic breast cancer cells," Nature, Vol.526, No.7571, pp.131-135, 2015. https://doi.org/10.1038/nature15260
- J. Huang et al., "Speed/accuracy trade-offs for modern convolutional object detectors," in Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3296-3305, 2017.
- J. R. Uijlings, K. E. Van De Sande, T. Gevers, and A. W. Smeulders, "Selective search for object recognition," International Journal of Computer Vision, Vol.104, No. 2, pp.154-171, 2013. https://doi.org/10.1007/s11263-013-0620-5
- I. Sutskever, J. Martens, G. Dahl, and G. Hinton, "On the importance of initialization and momentum in deep learning," in Proceedings of International Conference on Machine Learning, pp.1139-1147, 2013.
- K. Barre, et al., "Tillage and herbicide reduction mitigate the gap between conventional and organic farming effects on foraging activity of insectivorous bats," Ecology and Evolution, Vol.8, No.3, pp.1496-1506, 2018. https://doi.org/10.1002/ece3.3688
- A. Kuznetsova, P. B. Brockhoff, and R. H. Christensen, "lmerTest package: Tests in linear mixed effects models," Journal of Statistical Software, Vol.82, No.13, pp.1-26, 2017.
- N. Wahlström, et al., "A Strategy for the Sequential Recovery of Biomacromolecules from Red Macroalgae Porphyra umbilicalis Kützing," Industrial & Engineering Chemistry Research, Vol.57, No.1, pp.42-52, 2017. https://doi.org/10.1021/acs.iecr.7b03768