Automatic Classification of Academic Articles Using BERT Model Based on Deep Learning

Kim, In hu;Kim, Seong hee;

doi:10.3743/KOSIM.2022.39.3.293

Journal of the Korean Society for information Management (정보관리학회지)

Volume 39 Issue 3
/
Pages.293-310
/
2022
/
1013-0799(pISSN)
/
2586-2073(eISSN)

Korean Society for Information Management (한국정보관리학회)

DOI QR Code

Automatic Classification of Academic Articles Using BERT Model Based on Deep Learning

딥러닝 기반의 BERT 모델을 활용한 학술 문헌 자동분류

김인후 (중앙대학교 문헌정보학과 대학원) ;
김성희 (중앙대학교 문헌정보학과)

Received : 2022.08.21
Accepted : 2022.09.13
Published : 2022.09.30

https://doi.org/10.3743/KOSIM.2022.39.3.293 Citation PDF KSCI

Download PDF

⟨ Previous Next ⟩

Abstract

In this study, we analyzed the performance of the BERT-based document classification model by automatically classifying documents in the field of library and information science based on the KoBERT. For this purpose, abstract data of 5,357 papers in 7 journals in the field of library and information science were analyzed and evaluated for any difference in the performance of automatic classification according to the size of the learned data. As performance evaluation scales, precision, recall, and F scale were used. As a result of the evaluation, subject areas with large amounts of data and high quality showed a high level of performance with an F scale of 90% or more. On the other hand, if the data quality was low, the similarity with other subject areas was high, and there were few features that were clearly distinguished thematically, a meaningful high-level performance evaluation could not be derived. This study is expected to be used as basic data to suggest the possibility of using a pre-trained learning model to automatically classify the academic documents.

본 연구에서는 한국어 데이터로 학습된 BERT 모델을 기반으로 문헌정보학 분야의 문서를 자동으로 분류하여 성능을 분석하였다. 이를 위해 문헌정보학 분야의 7개 학술지의 5,357개 논문의 초록 데이터를 학습된 데이터의 크기에 따라서 자동분류의 성능에 어떠한 차이가 있는지를 분석, 평가하였다. 성능 평가척도는 정확률(Precision), 재현율(Recall), F 척도를 사용하였다. 평가결과 데이터의 양이 많고 품질이 높은 주제 분야들은 F 척도가 90% 이상으로 높은 수준의 성능을 보였다. 반면에 데이터 품질이 낮고 내용적으로 다른 주제 분야들과 유사도가 높고 주제적으로 확실히 구별되는 자질이 적을 경우 유의미한 높은 수준의 성능 평가가 도출되지 못하였다. 이러한 연구는 미래 학술 문헌에서 지속적으로 활용할 수 있는 사전학습모델의 활용 가능성을 제시하기 위한 기초자료로 활용될 수 있을 것으로 기대한다.

Keywords

References

Bae, Seongho, Ku, Xyle, Park, Chanbong, & Kim, Jungsu (2020). A latent topic modeling approach for jubject summarization of research on the military art and science in South Korea. Korean Journal of Military Art and Science, 76(2), 181-216. http://doi.org/10.31066/kjmas.2020.76.2.008
Choi, Yongseok & Lee, Kong Joo (2020). Performance analysis of Korean morphological analyzer based on transformer and BERT. Journal of Korean Institute of Information Scientists and Engineers, 47(8), 730-741. http://doi.org/10.5626/JOK.2020.47.8.730
Choi, Yunsoo & Choi, Sung-Pil (2019). A study on patent literature classification using distributed representation of technical terms. Journal of the Korean Society for Library and Information Science, 53(2), 179-199. https://doi.org/10.4275/KSLIS.2019.53.2.179
Electronics and Telecommunicaions Research Institute (2019). KorBERT. Available: https://aiopen.etri.re.kr/service_dataset.php
Hwang, Sangheum & Kim, Dohyun (2020). BERT-based classification model for Korean documents. Journal of Society for e-Business Studies, 25(1), 203-214. https://doi.org/10.7838/jsebs.2020.25.1.203
Kim, Hae-Chan-Sol, An, Dae-Jin, Yim, Jin-Hee, & Lieh, Hae-Young (2017). A study on automatic classification of record text using machine learning. Journal of the Korean Society for Information Management, 34(4), 321-344. https://doi.org/10.3743/KOSIM.2017.34.4.321
Kim, Pan-Jun (2016). An analytical study on performance factors of automatic classification based on machine learning. Journal of the Korean Society for Information Management, 33(2), 33-59. https://doi.org/10.3743/KOSIM.2016.33.2.033
Kim, Pan-Jun (2018). An analytical study on automatic classification of domestic journal articles based on machine learning. Journal of the Korean Society for Information Management, 35(2), 37-62. https://doi.org/10.3743/KOSIM.2018.35.2.037
Kim, Pan-Jun (2019). An analytical study on automatic classification of domestic journal articles using random forest. Journal of the Korean Society for Information Management, 36(2), 57-77. https://doi.org/10.3743/KOSIM.2019.36.2.057
Lee, Chi-Hoon, Lee, Yeon-Ji, & Lee, Dong-Hee (2020). A study of fine tuning pre-trained Korean bert for question answering performance development. Journal of Information Technology Services, 19(5), 83-91. https://doi.org/10.9716/KITS.2020.19.5.083
Lee, Sang-Woo, Kwon, Jung-Hyok, Kim, Nam, Choi, Hyung-Do, & Kim, Eui-Jik (2020). Research category classification for scientific literature on human health risk of electromagnecit fields. The Journal of Korean Institute of Electromagnetic Engineering and Science, 31(10), 839-842. https://doi.org/10.5515/KJKIEES.2020.31.10.839
Lee, Soobin, Kim, Seongdeok, Lee, Juhee, Ko, Youngsoo, & Song, Min (2021). Building and analyzing panic disorder social media corpus for automatic deep learning classification model. Journal of the Korean Society for Information Management, 38(2), 153-172. https://doi.org/10.3743/KOSIM.2021.38.2.153
National Research Foundation of Korea (2016). the classification table of academic research fields. Available: https://www.nrf.re.kr/biz/doc/class/view?menu_no=323
Park, Kyu Hwon & Jeong, Young-Seob (2021). Korean daily conversation topics classification using KoBERT. Proceedings of Korea Computer Congress 2021, 1735-1737.
Seong, So-yun, Choi, Jae-yong, & Kim, Kyoung-chul (2019). A study on improved comments generation using transformer. Journal of Korea Game Society, 19(5), 103-113. https://doi.org/10.7583/JKGS.2019.19.5.103
Shim, Jaekwoun (2021). A study on automatic classification of profanity sentences of elementary school students using BERT. Journal of Creative Information Culture, 7(2), 91-98. http://www.doi.org/10.32823/jcic.7.2.202105.91
Song, Euiseok & Kim, Namgyu (2021). Transformer-based text summarization using pre-trained language model. Management & Information Systems Review, 40(4), 31-47. https://doi.org/10.29214/DAMIS.2021.40.4.002
Yuk, Jee Hee & Song, Min (2018). A study of research on methods of automated biomedical document classification using topic modeling and deep learning. Journal of the Korean Society for Information Management, 35(2), 63-88. https://doi.org/10.3743/KOSIM.2018.35.2.063
Yun, Hee Seung & Jung, Jason J. (2021). Automated fact checking model using efficient transfomer. Journal of the Korea Institute of Information and Communication Engineering, 25(9), 1275-1278. https://doi.org/10.6109/jkiice.2021.25.9.1275
Asim, M. N., Ghani, M. U., Ibrahim, M. A., Mahmood, W., Dengel, A., & Ahmen, S. (2021). Benchmarking performance of machine and deep learning-based methodologies for Urdu text document classifiation. Neural Computing and Applications, 33, 5437-5469. https://doi.org/10.1007/s00521-020-05321-8
Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. https://arxiv.org/abs/1810.04805
El-Alami, F., El Alaoui, S. O., & Nahnahi, N. E. (2021). Contextual semantic embeddings based on fine-tuned AraBERT model for Arabic text multi-class categorization. Journal of King Saud University - Computer and Information Sciences, 2021, 1-7. http://doi.org/10.1016/j.jksuci.2021.02.005
Hikmah, A., Adi, S., & Sulistiyono, M. (2020). The best parameter tuning on RNN layers for inonesian text classification. Proceedings 2020 3rd International Seminar on Research of Information Technology and Intelligent Systems, 94-99. https://doi.org/10.1109/ISRITI51436.2020.9315425
Okur, H. I. & Sertbas, A. (2021). Pretrained neural models for turkish text classification. Proceeding of 2021 6th International Conference on Computer Science and Engineering, 174-179. https://doi.org/10.1109/UBMK52708.2021.9558878
Peters, M. E., Neumann M., Iyyer M., & Gardner M. (2018). Deep Contextualized Word Representations. https://arxiv.org/abs/1802.05365

Journal of the Korean Society for information Management (정보관리학회지)

Automatic Classification of Academic Articles Using BERT Model Based on Deep Learning

딥러닝 기반의 BERT 모델을 활용한 학술 문헌 자동분류

Abstract

Keywords

References

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)