DOI QR코드

DOI QR Code

Automatic Generation of Bibliographic Metadata with Reference Information for Academic Journals

학술논문 내에서 참고문헌 정보가 포함된 서지 메타데이터 자동 생성 연구

  • 정선기 (경기대학교 문헌정보학과) ;
  • 신현호 (경기대학교 문헌정보학과) ;
  • 지선영 (주식회사 보인정보기술) ;
  • 최성필 (경기대학교 문헌정보학과)
  • Received : 2022.07.17
  • Accepted : 2022.08.15
  • Published : 2022.08.31

Abstract

Bibliographic metadata can help researchers effectively utilize essential publications that they need and grasp academic trends of their own fields. With the manual creation of the metadata costly and time-consuming. it is nontrivial to effectively automatize the metadata construction using rule-based methods due to the immoderate variety of the article forms and styles according to publishers and academic societies. Therefore, this study proposes a two-step extraction process based on rules and deep neural networks for generating bibliographic metadata of scientific articlles to overcome the difficulties above. The extraction target areas in articles were identified by using a deep neural network-based model, and then the details in the areas were analyzed and sub-divided into relevant metadata elements. IThe proposed model also includes a model for generating reference summary information, which is able to separate the end of the text and the starting point of a reference, and to extract individual references by essential rule set, and to identify all the bibliographic items in each reference by a deep neural network. In addition, in order to confirm the possibility of a model that generates the bibliographic information of academic papers without pre- and post-processing, we conducted an in-depth comparative experiment with various settings and configurations. As a result of the experiment, the method proposed in this paper showed higher performance.

서지정보는 연구 주제의 최신 동향의 인지와 유용성을 검증하는 데에 참고할 수 있다. 즉, 각자 연구자들이 필요로 하는 문헌에 신속하게 접근하기 위해서는 학술논문에서 저자 정보, 요약, 초록, 참고문헌 등을 쉬운 방법으로 파악해야 한다. 그러나, 현재 출판되는 PDF 형식의 전자 학술논문은 출판 주체별로 고유한 양식을 띄고 있어서, 몇몇 특징에 의한 규칙 기반 추출법으로는 수많은 문헌에서 목표 정보를 추출하여 요약된 서지사항으로 자동 생성하기 어렵다. 이에 본 연구는 학술논문 서지사항 자동 생성에 있어서 양식의 다양성으로 인한 메타데이터 자동 추출의 난점을 극복할 방법을 제안한다. 제안하는 모델은 서지사항이 주로 기술되는 학술논문의 첫 페이지에서 목표 영역과 본문의 시작점을 구분할 수 있는 심층신경망 기반 모델과 앞의 모델로 추출된 서지사항을 상세한 메타데이터로 분류하고 재생성하는 규칙 기반 모델로 구성된다. 제안하는 모델은 참고문헌 요약정보를 생성하는 모델도 포함하는데, 본문의 말미와 참고문헌 시작점의 분리, 그리고 개별 참고문헌 추출을 규칙 기반 방법으로 진행하고, 추출한 각개 참고문헌의 서지정보를 분류하는 데에 심층신경망을 이용하도록 구성하였다. 추가로, 논문 자체의 서지정보를 전후처리 없이 추출/생성하는 모델의 가능성을 확인하기 위하여 참고문헌 영역까지 아우르는 모델을 구축하여 비교 실험을 진행하였다. 실험 결과 본 논문에서 제안하는 방식이 서지정보를 전후처리 하지 않고 진행한 비교 실험에 비하여 더 높은 성능을 보였다.

Keywords

References

  1. Ji, Seon-Young & Choi, Sung-Pil (2021). A study on recognition of citation metadata using Bidirectional GRU-CRF model based on pre-trained language model. Journal of the Korean Society for information Management, 38(1), 221-242. https://doi.org/10.3743/KOSIM.2021.38.1.221
  2. Ji, Seon-young (2021). A Study on Automatic Extrqaction of Metadata for papers in PDF format. Master's thesis, Kyonggi University.
  3. Kim, Jae-Hoon, Kim, Soon-Young, Im, Seok-Jong, & Hwang, Hye-Gyung (2019). Case study of journal article and reference mapping. Journal of the Korea Contents Association, 19(11), 262-269. https://doi.org/10.5392/JKCA.2019.19.11.262
  4. Kim, Ji-Hoon (2003). A study on automatic extraction of citation information for reference linking. Journal of the Korean Society for Library and Information Science, 37(1), 247-268. https://doi.org/10.4275/KSLIS.2003.37.1.247
  5. Kim, Seon-Wu, Ji, Seon-Young, Jeong, Hee-Seok, Yoon, Hwa-Mook, & Choi, Sung-Pil (2019). Metadata extraction based on deep learning from academic paper in PDF. Journal of KIISE, 46(7), 644-652. https://doi.org/10.5626/JOK.2019.46.7.644
  6. Kim, Seon-Wu, Ji, Seon-Young, Seol, Jae-Wook, Jeong, Hee-Seok, & Choi, Sung-Pil (2018). Bidirectional GRU-GRU CRF based citation metadata recognition. In Annual Conference on Human and Language Technology, 30, 461-464.
  7. Lim, Su-Hyun, Yoon, Te-Rin, Choi, Gyeong-Cheol, Cho, Won-Min, Heo, Jae-Jong, Han, Heyon-Woo, & Lee Kyung-Won (2019). A proposal for a bibliographic search interface using impact factor in the genealogy of academic literature. in Proceeding of HCI KOREA 2019, 526-529.
  8. An, D., Gao, L., Jiang, Z., Liu, R., & Tang, Z. (2017). Citation metadata extraction via deep neural network-based segment sequence labeling. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, 1967-1970.
  9. Besagni, D. & Belaid, A. (2004). Citation recognition for scientific publications in digital libraries. In First International Workshop on Document Image Analysis for Libraries, 244-252.
  10. Granitzer, M., Hristakeva, M., Knight, R., Jack, K., & Kern, R. (2012), A comparison of layout based bibliographic metadata extraction techniques. In Proceedings of the 2nd International Conference on Web Intelligence, Mining and Semantics, 2, 1-8.
  11. Kovacevic, A., Ivanovic, D., Milosavljevic, B., Konjovic, Z., & Surla, D. (2011). Automatic extraction of metadata from scientific publications for CRIS systems. Program: electronic library and information systems, 45(4), 376-396. https://doi.org/10.1108/00330331111182094
  12. Lee, J. (2020). KcBERT. GitHub. Available: https://github.com/Beomi/KcBERT
  13. Liu, R., Gao, L., An, D., Jiang, Z., & Tang, Z. (2017). Automatic document metadata extraction based on deep networks. In National CCF Conference on Natural Language Processing and Chinese Computing, 305-317.
  14. Powley, B. & Dale, R. (2007). High accuracy citation extraction and named entity recognition for a heterogeneous corpus of academic papers. In 2007 International Conference on Natural Language Processing and Knowledge Engineering, 119-124.
  15. Souza, A., Moreira, V., & Heuser, C. (2017). ARCTIC: metadata extraction from scientific papers in pdf using two-layer CRF. In Proceedings of the 2014 ACM Symposium on Document Engineering, 121-130.
  16. Tkaczyk, D., Bolikowski, L., Czeczko, A., & Rusek, K. (2012) A modular metadata extraction system for born-digital articles. In 2012 10th IAPR International Workshop on Document Analysis Systems, 11-16.
  17. Tkaczyk, D., Szostek, Szostek, P., Fedoryszak, M., Dendek, P. J., & Bolikowski, L. (2015). CERMINE: automatic extraction of structured metadata from scientific literature. International Journal on Document Analysis and Recognition, 18, 317-335 https://doi.org/10.1007/s10032-015-0249-8
  18. Ziviani, N., Goncalves, M. A., de Moura, E. S., Ribeiro-Neto, B., da Silva, A. S., & Veloso, A. (2011). Information Retrieval Research at UFMG. Journal of Information and Data Management, 2(2), 77-77.