Browse > Article
http://dx.doi.org/10.15813/kmr.2021.22.3.006

Building Specialized Language Model for National R&D through Knowledge Transfer Based on Further Pre-training  

Yu, Eunji (Graduate School of Business IT, Kookmin University)
Seo, Sumin (Graduate School of Business IT, Kookmin University)
Kim, Namgyu (Graduate School of Business IT, Kookmin University)
Publication Information
Knowledge Management Research / v.22, no.3, 2021 , pp. 91-106 More about this Journal
Abstract
With the recent rapid development of deep learning technology, the demand for analyzing huge text documents in the national R&D field from various perspectives is rapidly increasing. In particular, interest in the application of a BERT(Bidirectional Encoder Representations from Transformers) language model that has pre-trained a large corpus is growing. However, the terminology used frequently in highly specialized fields such as national R&D are often not sufficiently learned in basic BERT. This is pointed out as a limitation of understanding documents in specialized fields through BERT. Therefore, this study proposes a method to build an R&D KoBERT language model that transfers national R&D field knowledge to basic BERT using further pre-training. In addition, in order to evaluate the performance of the proposed model, we performed classification analysis on about 116,000 R&D reports in the health care and information and communication fields. Experimental results showed that our proposed model showed higher performance in terms of accuracy compared to the pure KoBERT model.
Keywords
National R&D; Knowledge Transfer; Pre-trained Language Model; BERT; Further Pre-training;
Citations & Related Records
Times Cited By KSCI : 2  (Citation Analysis)
연도 인용수 순위
1 고영만, 서태설, 조순영 (2006). 국가지식정보 자원 분류 체계 표준화 연구. 한국문헌정보학회지, 40(3), 151-173.   DOI
2 김선우, 고건우, 최원준, 정희석, 윤화묵, 최성필 (2018). 기술 과학 분야 학술문헌에 대한 학습집합 반자동 구축 및 자동 분류 통합 연구. 정보관리학회지, 35(4), 141-164.   DOI
3 김재수 (2008). 국가과학기술종합정보서비스(NTIS)-NTIS 구축사업 개요. 지식정보인프라, 30, 31-34.
4 김창식, 곽기영 (2015). 조직구성원의 네트워크 위치가 지식공유에 미치는 영향. 지식경영연구, 16(2), 67-89.   DOI
5 김태현, 양명석, 최광남 (2019). 국가R&D정보 활용을 위한 전문용어사전 구축. 한국콘텐츠학회논문지, 19(10), 217-225.   DOI
6 김현종, 이강배, 류승우, 홍순구 (2020). A study on classification scheme generation for automatic classification of unlabeled documents. 디지털콘텐츠학회논문지, 21(12), 2211-2219.
7 백윤정, 김은실 (2008). 실행공동체(CoP)내 지식공유의 영향 요인: 구조적 특성과 관계적 특성의 조절효과를 중심으로. 지식경영연구, 9(2), 63-86.   DOI
8 Devlin, J., Chang, W., Lee, K., & Toutanava, K. (2018). BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv: 1810.04805.
9 Bahdanau, D., Cho, K., & Bengio, Y. (2014). Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv: 1409.0473.
10 Peter, M., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., & Zettlemoyer, L. (2018). Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics(NAACL), 1, 2227-2237.
11 최종윤, 한혁, 정유철 (2020). 국가 과학기술 표준분류 체계 기반 연구보고서 문서의 자동 분류 연구. 한국산학기술학회논문지, 21(1), 169-177.   DOI
12 SKTBrain, KoBERT, GitHub Repository. https://github.com/SKTBrain/KoBERT
13 국가과학기술지식정보서비스(NTIS). www.ntis.go.kr
14 Sutskever, I., Vinyals, O., & Le, Q. V. (2014). Sequence to sequence learning with neural networks. In Proceedings. of the 27th International Conference on Neural Information Processing Systems(NIPS), 2, 3104-3112.
15 오효정, 장문수, 장명길 (2006). 정답문서집합 자동 구축을 위한 속성 기반 분류 방법. 정보과학회논문지, 30(7/8), 764-772.
16 이재성, 전승표, 유형선 (2018) 한국표준산업분류를 기준으로 한 문서의 자동 분류 모델에 관한 연구. 지능정보연구, 24(3), 221-241.   DOI
17 Araci, D. (2019). FinBERT: Financial sentiment analysis with pre-trained language models. arXiv preprint arXiv: 1908:10063.
18 Beltagy, I., Lo, K., & Cohan, A. (2019). SciBERT: A pretrained language model for scientific text. Proc. of the 2019 Conference on Empirical Methods in Natural Language Processing and 9th International Joint Conference on Natural Language Processing(EMNLP-IJCNLP), 3615-3620.
19 Chalkids, L., Fergadiotis, M., Malakasioits, P., Aletras, N., & Androutsopoulos, I. (2020). LEGAL-BERT: The muppets straight out of law school. arXiv preprint arXiv: 2010:02559.
20 Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Journal of Neural Computation, 9(8), 1735-1780.   DOI
21 Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. In Proceedings of International Conference on Learning Representations(ICLR).
22 특허청(KIPO). www.kipo.go.kr
23 Le Cun, Y. (1989). Backpropagation applied to handwritten zip code recognition. Neural Computation, 1(4), 541-551.   DOI
24 Mikolov, T., Karafiat, M., Burget, L., & Cernocky, J. (2010). Recurrent neural network based language model. In 11th Annual Conference of the International Speech Communication Association(INTERSPEECH), 1045-1048.
25 최은수, 이윤철 (2009). 정보기술이 지식경영활동과 성과에 미치는 효과에 대한 실증분석. 지식경영연구, 10(3), 51-80.
26 Penningion, J., Socher, R., & Manning, C. (2014). GloVe: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing(EMNLP), 1532-1543.
27 Lee, J. H., Yoon, W. J., Kim, S. D., Kim, D. H., Kim, S. K., So, C. H., & Kang, J. W. (2019). BioBERT: A pre-trained biomedical language representation model for biomedical text mining. Bioinformatics, 36(4), 1234-1240.
28 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017). Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems(NIPS), 6000-6010. [URL]
29 과학기술정책지원서비스. https://www.k2base.re.kr/clInfo/aboutClInfo.do