DOI QR코드

DOI QR Code

A method for metadata extraction from a collection of records using Named Entity Recognition in Natural Language Processing

자연어 처리의 개체명 인식을 통한 기록집합체의 메타데이터 추출 방안

  • Received : 2024.04.16
  • Accepted : 2024.05.10
  • Published : 2024.05.31

Abstract

This pilot study explores a method of extracting metadata values and descriptions from records using named entity recognition (NER), a technique in natural language processing (NLP), a subfield of artificial intelligence. The study focuses on handwritten records from the Guro Industrial Complex, produced during the 1960s and 1970s, comprising approximately 1,200 pages and 80,000 words. After the preprocessing process of the records, which included digitization, the study employed a publicly available language API based on Google's Bidirectional Encoder Representations from Transformers (BERT) language model to recognize entity names within the text. As a result, 173 names of people and 314 of organizations and institutions were extracted from the Guro Industrial Complex's past records. These extracted entities are expected to serve as direct search terms for accessing the contents of the records. Furthermore, the study identified challenges that arose when applying the theoretical methodology of NLP to real-world records consisting of semistructured text. It also presents potential solutions and implications to consider when addressing these issues.

본 연구는 인공지능의 하위분야인 자연어 처리(NLP)의 개체명 인식(NER)을 통하여 기록에 내재된 메타데이터 값과 기술 정보를 추출하는 방안에 대한 시험적 연구이다. 연구 대상은 1960~1970년대에 생산된 구로공단 수기 기록물(약 1,200 쪽, 8만여 단어)을 대상으로 하였다. 디지털화를 포함하는 전처리 과정과 함께 기록 텍스트에 대해서 구글의 BERT 언어모델에 기반하여 구현되어 공개된 언어 API를 사용하여 개체명을 인식하였다. 그 결과로 구로공단의 과거 기록에 포함된 173개의 인명과 314개의 조직 및 기관 개체명을 추출할 수 있었고, 이는 기록의 내용에 대한 직접적인 검색어로 사용될 수 있다고 기대된다. 그리고 자연어 처리의 이론적 방법론을 반·비정형의 텍스트로 이루어진 실제 기록물에 적용할 때 발생하는 문제점을 파악하여 해결 방안과 고려해야 할 시사점을 제시했다.

Keywords

References

  1. Ahn, Sejin, Hwang, Hyunho, & Yim, Junhee (2022). A Case Study on the Application of AI-OCR for Data Transformation of Paper Records. Journal of the Korean Society for Information Management, 39(3), 165-193. https://doi.org/10.3743/KOSIM.2022.39.3.165
  2. Archival Description Rules. NAK 13 : 2022(v2.1).
  3. ETRI SW-SoC Convergence R&BD Center [n.d.]. Language analysis techniques. Public AI Open API.Data Service Portal. Available: https://aiopen.etri.re.kr/guide/WiseNLU
  4. ETRI SW-SoC Convergence R&BD Center [n.d.]. Provision API. Public AI Open API.Data Service Portal. Available: https://aiopen.etri.re.kr/serviceList
  5. Go, Myunghyun, Kim, Hakdong, Lim, Heonyeong, Lee, Yurim, Jee, Minkyu, & Kim, Wonil (2019). A Study on Named Entity Recognition for Effective Dialogue Information Prediction. Journal of Broadcast Engineering, 24(1), 58-66. https://doi.org/10.5909/JBE.2019.24.1.58
  6. Han, Mi-Kyoung (2020). Letters from protestant missionaries in Korea (1884-1942) & digital archive. Paju: bogosa. 
  7. Information and documentation - Records management - Part 1: Concepts and principles. ISO 15489-1 : 2016, 3.5, 8.3, 9.4.
  8. Kang, Beom-mo (2014). Text Context and Word Meaning: Latent Semantic Analysis. EONEOHAG, 68, 3-34. https://doi.org/10.17290/jlsk.2014..68.3
  9. Kim, Haklae (2022). Considerations for Applying Korean Natural Language Processing Technology in Records Management. Journal of Korean Society of Archives and Records Management, 22(4), 129-149. https://doi.org/10.14404/JKSARM.2022.22.4.129
  10. Kim, In hu & Kim, Seong hee (2022). Automatic Classification of Academic Articles Using BERT Model Based on Deep Learning. Journal of the Korean Society for Information Management, 39(3), 293-310. https://doi.org/10.3743/KOSIM.2022.39.3.293
  11. Kim, Tae-Young, Gang, Ju-Yeon, Kim, Geon, & Oh, Hyo-Jung (2018). A Study on the Current Status and Application Strategies for Intelligent Archival Information Services. Journal of Korean Society of Archives and Records Management, 18(4), 149-182. https://doi.org/10.14404/JKSARM.2018.18.4.149
  12. Lim, Soojong (2021). An Analysis of Trends in the Super-Gigantic AI Language Model. Datascience. KOSTAT Statistics Plus, 16, 70-85.
  13. Metadata Standard for Records and Archives Management. NAK 8:2022(v2.3).
  14. ratsgo (2017). Basic Procedure for NLP & Lexical Analysis. Ratsgo's blog. Available: https://ratsgo.github.io/natural%20language%20processing/2017/03/22/lexicon/
  15. Yim, Jin Hee (2021). Suggestions on how to convert official documents to Machine Readable. The Korean Journal of Archival Studies, 67, 99-138. https://doi.org/10.20923/kjas.2021.67.099
  16. Bak, G. (2012). Continuous classification: Capturing dyanamic relationships among information resources. Archival Science, 12(3), 287-318. https://doi.org/10.1007/s10502-012-9171-8
  17. Colavizza, G., Blanke, T., Jeurgens, C., & Noordegraaf, J. (2021). Archives and AI: An Overview of Current Debates and Future Perspectives. Journal on Computing and Cultural Heritage, 15(1), 1-15. https://doi.org/10.1145/3479010
  18. Colavizza, G., Ehrmann, M., & Bortoluzzi, F. (2019) Index-Driven Digitization and Indexation of Historical Archives. Front. Digit. Humanities, 6, 1-16.
  19. Rolan, G., Humphries, G., Jeffrey, L., Samaras, E., Antsoupova, T., & Stuart, K. (2019). More human than human? Artificial intelligence in the archive. Archives and Manuscripts, 47(2), 1-25. https://doi.org/10.1080/01576895. 2018.1502088