Building Specialized Language Model for National R&D through Knowledge Transfer Based on Further Pre-training

Yu, Eunji;Seo, Sumin;Kim, Namgyu;

doi:10.15813/kmr.2021.22.3.006

Knowledge Management Research (지식경영연구)

Volume 22 Issue 3
/
Pages.91-106
/
2021
/
1229-9553(pISSN)

The Knowledge Management Society of Korea (한국지식경영학회)

DOI QR Code

Building Specialized Language Model for National R&D through Knowledge Transfer Based on Further Pre-training

추가 사전학습 기반 지식 전이를 통한 국가 R&D 전문 언어모델 구축

Yu, Eunji (Graduate School of Business IT, Kookmin University) ;
Seo, Sumin (Graduate School of Business IT, Kookmin University) ;
Kim, Namgyu (Graduate School of Business IT, Kookmin University)

유은지 (국민대학교 비즈니스IT 전문대학원) ;
서수민 (국민대학교 비즈니스IT 전문대학원) ;
김남규 (국민대학교 비즈니스IT 전문대학원)

Received : 2021.08.22
Accepted : 2021.09.05
Published : 2021.09.30

https://doi.org/10.15813/kmr.2021.22.3.006 Citation PDF KSCI

Download PDF

⟨ Previous Next ⟩

Abstract

With the recent rapid development of deep learning technology, the demand for analyzing huge text documents in the national R&D field from various perspectives is rapidly increasing. In particular, interest in the application of a BERT(Bidirectional Encoder Representations from Transformers) language model that has pre-trained a large corpus is growing. However, the terminology used frequently in highly specialized fields such as national R&D are often not sufficiently learned in basic BERT. This is pointed out as a limitation of understanding documents in specialized fields through BERT. Therefore, this study proposes a method to build an R&D KoBERT language model that transfers national R&D field knowledge to basic BERT using further pre-training. In addition, in order to evaluate the performance of the proposed model, we performed classification analysis on about 116,000 R&D reports in the health care and information and communication fields. Experimental results showed that our proposed model showed higher performance in terms of accuracy compared to the pure KoBERT model.

최근 딥러닝 기술이 빠르게 발전함에 따라 국가 R&D 분야의 방대한 텍스트 문서를 다양한 관점에서 분석하기 위한 수요가 급증하고 있다. 특히 대용량의 말뭉치에 대해 사전학습을 수행한 BERT(Bidirectional Encoder Representations from Transformers) 언어모델의 활용에 대한 관심이 높아지고 있다. 하지만 국가 R&D와 같이 고도로 전문화된 분야에서 높은 빈도로 사용되는 전문어는 기본 BERT에서 충분히 학습이 이루어지지 않은 경우가 많으며, 이는 BERT를 통한 전문 분야 문서 이해의 한계로 지적되고 있다. 따라서 본 연구에서는 최근 활발하게 연구되고 있는 추가 사전학습을 활용하여, 기본 BERT에 국가 R&D 분야 지식을 전이한 R&D KoBERT 언어모델을 구축하는 방안을 제시한다. 또한 제안 모델의 성능 평가를 위해 보건의료, 정보통신 분야의 과제 약 116,000건을 대상으로 분류 분석을 수행한 결과, 제안 모델이 순수한 KoBERT 모델에 비해 정확도 측면에서 더 높은 성능을 나타내는 것을 확인하였다.

Keywords

References

고영만, 서태설, 조순영 (2006). 국가지식정보 자원 분류 체계 표준화 연구. 한국문헌정보학회지, 40(3), 151-173. https://doi.org/10.4275/KSLIS.2006.40.3.151
김선우, 고건우, 최원준, 정희석, 윤화묵, 최성필 (2018). 기술 과학 분야 학술문헌에 대한 학습집합 반자동 구축 및 자동 분류 통합 연구. 정보관리학회지, 35(4), 141-164. https://doi.org/10.3743/KOSIM.2018.35.4.141
김재수 (2008). 국가과학기술종합정보서비스(NTIS)-NTIS 구축사업 개요. 지식정보인프라, 30, 31-34.
김창식, 곽기영 (2015). 조직구성원의 네트워크 위치가 지식공유에 미치는 영향. 지식경영연구, 16(2), 67-89. https://doi.org/10.15813/kmr.2015.16.2.004
김태현, 양명석, 최광남 (2019). 국가R&D정보 활용을 위한 전문용어사전 구축. 한국콘텐츠학회논문지, 19(10), 217-225. https://doi.org/10.5392/JKCA.2019.19.10.217
김현종, 이강배, 류승우, 홍순구 (2020). A study on classification scheme generation for automatic classification of unlabeled documents. 디지털콘텐츠학회논문지, 21(12), 2211-2219.
백윤정, 김은실 (2008). 실행공동체(CoP)내 지식공유의 영향 요인: 구조적 특성과 관계적 특성의 조절효과를 중심으로. 지식경영연구, 9(2), 63-86. https://doi.org/10.15813/KMR.2008.9.2.004
오효정, 장문수, 장명길 (2006). 정답문서집합 자동 구축을 위한 속성 기반 분류 방법. 정보과학회논문지, 30(7/8), 764-772.
이재성, 전승표, 유형선 (2018) 한국표준산업분류를 기준으로 한 문서의 자동 분류 모델에 관한 연구. 지능정보연구, 24(3), 221-241. https://doi.org/10.13088/jiis.2018.24.3.221
최은수, 이윤철 (2009). 정보기술이 지식경영활동과 성과에 미치는 효과에 대한 실증분석. 지식경영연구, 10(3), 51-80.
최종윤, 한혁, 정유철 (2020). 국가 과학기술 표준분류 체계 기반 연구보고서 문서의 자동 분류 연구. 한국산학기술학회논문지, 21(1), 169-177. https://doi.org/10.5762/KAIS.2020.21.1.169
Araci, D. (2019). FinBERT: Financial sentiment analysis with pre-trained language models. arXiv preprint arXiv: 1908:10063.
Bahdanau, D., Cho, K., & Bengio, Y. (2014). Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv: 1409.0473.
Beltagy, I., Lo, K., & Cohan, A. (2019). SciBERT: A pretrained language model for scientific text. Proc. of the 2019 Conference on Empirical Methods in Natural Language Processing and 9th International Joint Conference on Natural Language Processing(EMNLP-IJCNLP), 3615-3620.
Chalkids, L., Fergadiotis, M., Malakasioits, P., Aletras, N., & Androutsopoulos, I. (2020). LEGAL-BERT: The muppets straight out of law school. arXiv preprint arXiv: 2010:02559.
Devlin, J., Chang, W., Lee, K., & Toutanava, K. (2018). BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv: 1810.04805.
Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Journal of Neural Computation, 9(8), 1735-1780. https://doi.org/10.1162/neco.1997.9.8.1735
Le Cun, Y. (1989). Backpropagation applied to handwritten zip code recognition. Neural Computation, 1(4), 541-551. https://doi.org/10.1162/neco.1989.1.4.541
Lee, J. H., Yoon, W. J., Kim, S. D., Kim, D. H., Kim, S. K., So, C. H., & Kang, J. W. (2019). BioBERT: A pre-trained biomedical language representation model for biomedical text mining. Bioinformatics, 36(4), 1234-1240.
Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. In Proceedings of International Conference on Learning Representations(ICLR).
Mikolov, T., Karafiat, M., Burget, L., & Cernocky, J. (2010). Recurrent neural network based language model. In 11th Annual Conference of the International Speech Communication Association(INTERSPEECH), 1045-1048.
Penningion, J., Socher, R., & Manning, C. (2014). GloVe: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing(EMNLP), 1532-1543.
Peter, M., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., & Zettlemoyer, L. (2018). Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics(NAACL), 1, 2227-2237.
Sutskever, I., Vinyals, O., & Le, Q. V. (2014). Sequence to sequence learning with neural networks. In Proceedings. of the 27th International Conference on Neural Information Processing Systems(NIPS), 2, 3104-3112.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017). Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems(NIPS), 6000-6010. [URL]
SKTBrain, KoBERT, GitHub Repository. https://github.com/SKTBrain/KoBERT
국가과학기술지식정보서비스(NTIS). www.ntis.go.kr
과학기술정책지원서비스. https://www.k2base.re.kr/clInfo/aboutClInfo.do
특허청(KIPO). www.kipo.go.kr

Knowledge Management Research (지식경영연구)

Building Specialized Language Model for National R&D through Knowledge Transfer Based on Further Pre-training

추가 사전학습 기반 지식 전이를 통한 국가 R&D 전문 언어모델 구축

Abstract

Keywords

References

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)