DOI QR코드

DOI QR Code

Automatic Extraction of References for Research Reports using Deep Learning Language Model

딥러닝 언어 모델을 이용한 연구보고서의 참고문헌 자동추출 연구

  • Received : 2023.05.15
  • Accepted : 2023.06.10
  • Published : 2023.06.30

Abstract

The purpose of this study is to assess the effectiveness of using deep learning language models to extract references automatically and create a reference database for research reports in an efficient manner. Unlike academic journals, research reports present difficulties in automatically extracting references due to variations in formatting across institutions. In this study, we addressed this issue by introducing the task of separating references from non-reference phrases, in addition to the commonly used metadata extraction task for reference extraction. The study employed datasets that included various types of references, such as those from research reports of a particular institution, academic journals, and a combination of academic journal references and non-reference texts. Two deep learning language models, namely RoBERTa+CRF and ChatGPT, were compared to evaluate their performance in automatic extraction. They were used to extract metadata, categorize data types, and separate original text. The research findings showed that the deep learning language models were highly effective, achieving maximum F1-scores of 95.41% for metadata extraction and 98.91% for categorization of data types and separation of the original text. These results provide valuable insights into the use of deep learning language models and different types of datasets for constructing reference databases for research reports including both reference and non-reference texts.

본 연구는 단행본, 학술지, 보고서 등 다양한 종류의 발간물로 구성된 연구보고서의 참고문헌 데이터베이스를 효율적으로 구축하기 위한 것으로 딥러닝 언어 모델을 이용하여 참고문헌의 자동추출 성능을 비교 분석하고자 한다. 연구보고서는 학술지와는 다르게 기관마다 양식이 상이하여 참고문헌 자동추출에 어려움이 있다. 본 연구에서는 참고문헌 자동추출에 널리 사용되는 연구인 메타데이터 추출과 더불어 참고문헌과 참고문헌이 아닌 문구가 섞여 있는 환경에서 참고문헌만을 분리해내는 원문 분리 연구를 통해 이 문제를 해결하였다. 자동 추출 모델을 구축하기 위해 특정 연구기관의 연구보고서 내 참고문헌셋, 학술지 유형의 참고문헌셋, 학술지 참고문헌과 비참고문헌 문구를 병합한 데이터셋을 구성했고, 딥러닝 언어 모델인 RoBERTa+CRF와 ChatGPT를 학습시켜 메타데이터 추출과 자료유형 구분 및 원문 분리 성능을 측정하였다. 그 결과 F1-score 기준 메타데이터 추출 최대 95.41%, 자료유형 구분 및 원문 분리 최대 98.91% 성능을 달성하는 등 유의미한 결과를 얻었다. 이를 통해 비참고문헌 문구가 포함된 연구보고서의 참고문헌 추출에 대한 딥러닝 언어 모델과 데이터셋 유형별 참고문헌 구축 방향을 제안하였다.

Keywords

Acknowledgement

본 연구는 정보통신정책연구원 2023년도 정보자료 운영사업의 지원을 받아 수행되었음.

References

  1. Ji, Seon-yeong & Choi, Sung-pil (2021). A study on recognition of citation metadata using bidirectional GRU-CRF model based on pre-trained language model. Journal of the Korean Society for information Management, 38(1), 221-242. https://doi.org/10.3743/KOSIM.2021.38.1.221
  2. Lee, Kangsandajeong, Lee, Hyejin, & Hyun, Mihwan (2022). A study on national r&d report reference technological improvement. Journal of the Korea Convergence Society, 13(1), 31-42. https://doi.org/10.15207/JKCS.2022.13.01.031
  3. Besagni, D., Belaid, A., & Benet, N. (2003). A segmentation method for bibliographic references by contextual tagging of fields. Seventh International Conference on Document Analysis and Recognition, 384-388. https://doi.org/10.1109/ICDAR.2003.1227694
  4. Chen, C. (2006). CiteSpace II: Detecting and visualizing emerging trends and transient patterns in scientific literature. Journal of the American Society for Information Science and Technology, 57(3), 359-377. https://doi.org/10.1002/asi.20317
  5. Choi, W., Yoon, H. M., Hyun, M. H., Lee, H. J., Seol, J. W., Lee, K. D., Yoon, Y. J., & Kong, H. (2023). Building an annotated corpus for automatic metadata extraction from multilingual journal article references. PloS one, 18(1), e0280637. https://doi.org/10.1371/journal.pone.0280637
  6. Councill, I., Giles, C., & Kan, M. (2008). ParsCit: an Open-source CRF Reference String Parsing Package. LREC, 8, 661-667.
  7. Dai, Z., Wang, X., Ni, P., Li, Y., Li, G., & Bai, X. (2019). Named entity recognition using BERT BiLSTM CRF for Chinese electronic health records. 2019 12th international congress on image and signal processing, biomedical engineering and informatics, 1-5. https://doi.org/10.1109/CISP-BMEI48845.2019.8965823
  8. Devlin, J., Chang, M., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. https://doi.org/10.48550/arXiv.1810.04805
  9. Fritzler, A., Logacheva, V., & Kretov, M. (2019). Few-shot classification in named entity recognition task. Proceedings of the 34th ACM/SIGAPP Symposium on Applied Computing, 993-1000. https://doi.org/10.1145/3297280.3297378
  10. Gonzalez-Gallardo, C., Boros, E., Girdhar, N., Hamdi, A., Moreno, J., & Doucet, A. (2023). Yes but.. Can ChatGPT Identify Entities in Historical Documents?. https://doi.org/10.48550/arXiv.2303.17322
  11. Hetzner, E. (2008). A simple method for citation metadata extraction using hidden markov models. Proceedings of the 8th ACM/IEEE-CS joint conference on Digital libraries, 280-284. https://doi.org/10.1145/1378889.1378937
  12. Hollingsworth, B., Lewin, I., & Tidhar, D. (2005). Retrieving hierarchical text structure from typeset scientific articles: a prerequisite for e-science text mining. Proc. of the 4th UK E-Science All Hands Meeting, 67-273.
  13. Hu, Y., Ameer, I., Zuo, X., Peng, X., Zhou, Y., Li, Z., Li, Y., Li, J., Jiang, X., & Xu, H. (2023). Zero-shot Clinical Entity Recognition using ChatGPT. https://doi.org/10.48550/arXiv.2303.16416
  14. Huang, I.., Ho, J., Kao, H., & Lin, W. (2004). Extracting citation metadata from online publication lists using BLAST. Advances in Knowledge Discovery and Data Mining: 8th Pacific-Asia Conference, 539-548. https://doi.org/10.1007/978-3-540-24775-3_64
  15. Kim, J., Choi, N., Lim, S., Kim, J., Chung, S., Woo, H., Song, M., & Choi, J. D. (2021). Analysis of Zero-Shot Crosslingual Learning between English and Korean for Named Entity Recognition. Proceedings of the 1st Workshop on Multilingual Representation Learning, 224-237. https://doi.org/10.18653/v1/2021.mrl-1.19
  16. Korea Institute of Science and Technology Information (2022). DeepData-REFMETA Version 1.0. http://doi.org/10.23057/47
  17. Lauscher, A., Ravishankar, V., Vulic, I., & Glavas, G. (2020). From zero to hero: on the limitations of zero-shot cross-lingual transfer with multilingual transformers. https://doi.org/10.48550/arXiv.2005.00633
  18. Liu, X., Chen, H., & Xia, W. (2022). Overview of named entity recognition. Journal of Contemporary Educational Research, 6(5), 65-68. https://doi.org/10.26689/jcer.v6i5.3958
  19. Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy O., Lewis, M., Zettlemoyer, L., & Stoyanov, V. (2019). Roberta: A robustly optimized bert pretraining approach. https://doi.org/10.48550/arXiv.1907.11692
  20. Lopez, P. (2009). GROBID: Combining automatic bibliographic data recognition and term extraction for scholarship publications. Research and Advanced Technology for Digital Libraries: 13th European Conference, 473-474. https://doi.org/10.1007/978-3-642-04346-8_62
  21. OpenAI (2022). Introducing ChatGPT. Available: https://openai.com/blog/chatgpt/
  22. Park, S., Moon, J., Kim, S., Cho, W. I., Han, J., Park, J., Song, C., Kim, J., Song, Y., Oh, T., Lee, J., Oh, J., Lyu, S., Jeong, Y., Lee, I., Seo, S., Lee, D., Kim, H., Lee, M., Jang, S., Do, S., Kim, S., Lim, K., Lee, J., Park, K., Shin, J., Kim, S., Park, L., Oh, A., Ha, J., & Cho, K. (2021). Klue: Korean Language Understanding Evaluation. https://doi.org/10.48550/arXiv.2105.09680
  23. Radford, A., Narasimhan, K., Salimans, T., & Sutskever, I. (2018). Improving language understanding by generative pre-training.
  24. Rodrigues A. D., Colavizza, G., & Kaplan, F. (2018). Deep reference mining from scholarly literature in the arts and humanities. Frontiers in Research Metrics and Analytics, 21. https://doi.org/10.3389/frma.2018.00021
  25. Segura-Bedmar, I., Martinez Fernandez, P., & Herrero-Zazo, M. (2013). Semeval-2013 task 9: Extraction of drug-drug interactions from biomedical texts (ddiextraction 2013). Association for Computational Linguistics, 341-350.
  26. Souza, F., Nogueira, R., & Lotufo, R. (2019). Portuguese named entity recognition using BERTCRF. https://doi.org/10.48550/arXiv.1909.10649
  27. Tkaczyk, D., Szostek, P., Fedoryszak, M., Dendek, P. J., & Bolikowski, L. (2015). CERMINE: automatic extraction of structured metadata from scientific literature. International Journal on Document Analysis and Recognition, 18, 317-335. https://doi.org/10.1007/s10032-015-0249-8
  28. Van Eck, N. & Waltman, L. (2010). Software survey: VOSviewer, a computer program for bibliometric mapping. Scientometrics, 84(2), 523-538. https://doi.org/10.1007/s11192-009-0146-3
  29. Voskuil, K. & Verberne, S. (2021). Improving reference mining in patents with BERT. https://doi.org/10.48550/arXiv.2101.01039
  30. Wang, S., Sun, X., Li, X., Ouyang, R., Wu, F., Zhang, T., Li, J., & Wang, G. (2023). GPT-NER: Named Entity Recognition via Large Language Models. https://doi.org/10.48550/arXiv.2304.10428
  31. Wei, X., Cui, X., Cheng, N., Wang, X., Zhang, X., Huang, S., Xie, P., Xu, J., Chen, Y., Zhang, M., Jiang, Y., & Han, W. (2023). Zero-shot information extraction via chatting with chatgpt. https://doi.org/10.48550/arXiv.2302.10205
  32. White, J., Fu, Q., Hays, S., Sandborn, M., Olea, C., Gilbert, H., Elnashar, A., Spencer-Smith, J., & Schmidt, D. (2023). A prompt pattern catalog to enhance prompt engineering with chatgpt. https://doi.org/10.48550/arXiv.2302.11382
  33. Wu, Y., Huang, J., Xu, C., Zheng, H., Zhang, L., & Wan, J. (2021). Research on named entity recognition of electronic medical records based on roberta and radical-level feature. Wireless Communications and Mobile Computing, 2021, 1-10. https://doi.org/10.1155/2021/2489754
  34. Yang, Y. & Katiyar, A. (2020). Simple and effective few-shot named entity recognition with structured nearest neighbor learning. https://doi.org/10.48550/arXiv.2010.02405
  35. Zhang, X., Zou, J., Le, D. X., & Thoma, G. R. (2011). A structural SVM approach for reference parsing. BMC bioinformatics, 12, 1-7. https://doi.org/10.1186/1471-2105-12-S3-S7