DOI QR코드

DOI QR Code

Performance Assessment of Machine Learning and Deep Learning in Regional Name Identification and Classification in Scientific Documents

머신러닝을 이용한 과학기술 문헌에서의 지역명 식별과 분류방법에 대한 성능 평가

  • Received : 2024.02.27
  • Accepted : 2024.04.12
  • Published : 2024.04.30

Abstract

Generative AI has recently been utilized across all fields, achieving expert-level advancements in deep data analysis. However, identifying regional names in scientific literature remains a challenge due to insufficient training data and limited AI application. This study developed a standardized dataset for effectively classifying regional names using address data from Korean institution-affiliated authors listed in the Web of Science. It tested and evaluated the applicability of machine learning and deep learning models in real-world problems. The BERT model showed superior performance, with a precision of 98.41%, recall of 98.2%, and F1 score of 98.31% for metropolitan areas, and a precision of 91.79%, recall of 88.32%, and F1 score of 89.54% for city classifications. These findings offer a valuable data foundation for future research on regional R&D status, researcher mobility, collaboration status, and so on.

생성형 AI는 최근 모든 분야에서 활용되고 있으며, 심층 데이터 분석 분야에서도 전문가를 대체할 수준으로 발전하고 있다. 그러나 과학기술 문헌에서의 지역명 식별은 학습 데이터의 부족과 이에 따른 인공지능 모델을 적용한 사례가 전무한 실정이다. 본 연구는 Web of Science에서 한국 기관 소속 저자들의 주소 데이터를 활용해 지역명을 분류하기 위한 데이터셋을 구축하고, 머신러닝 및 딥러닝 모델의 적용을 실험 및 평가했다. 실험 결과 BERT 모델이 가장 우수한 성능을 보였으며, 광역 분류에서는 정밀도 98.41%, 재현율 98.2%, F1 점수 98.31%를 기록하였다. 시군구 분류에서는 정밀도 91.79%, 재현율 88.32%, F1 점수 89.54%를 달성하였다. 이 결과는 향후 지역 R&D 현황, 지역 간 연구자 이동성, 지역 공동 연구 등 다양한 연구의 기반 데이터로 활용이 가능하다.

Keywords

Acknowledgement

이 논문은 2024년도 한국과학기술정보연구원(KISTI)의 기본사업으로 수행된 연구입니다.(과제번호: K-24-L03-C01-S01)

References

  1. H. A. Teich, "In Search of Evidence-based Science Policy: From the Endless Frontier to SciSIP," Annals of Science and Technology Policy, vol. 2, no. 2, 2018, pp. 75-199.  https://doi.org/10.1561/110.00000007
  2. W. Glanzel, H. F. Moed, U. Schmoch, and M. Thelwall, Springer Handbook of Science and Technology Indicators(1st Edition). Cham: Springer, 2019. 
  3. L. Leydesdorff, "Problems with the 'Measurement' of National Scientific Performance," Science and Public Policy, vol. 15, no. 3, June 1988, pp. 149-152. 
  4. R. E. De Bruin and H. F. Moed, "Delimitation of Scientific Subfields using Cognitive Words from Corporate Addresses in Scientific Publications," Scientometrics, vol. 26, 1993, pp. 65-80.  https://doi.org/10.1007/BF02016793
  5. Z. Taskin and U. Al, "Standardization Problem of Author Affiliations in Citation Indexes," Scientometrics, vol. 98, 2014, pp. 347-368.  https://doi.org/10.1007/s11192-013-1004-x
  6. J. Kim, S. Hong, and G. R. Thoma, "Labeling Author Affiliations in Biomedical Articles Using Markov Model Classifier," In Int'l Conf. Data Mining, Las Vegas, USA, July 2017, pp. 105-110. 
  7. K. Min, J. Song, K. Yu, and J. Kim, "A Method for Detecting Location Information using Attention-based Deep Learning Model and Word Embedding," Journal of Korean Society for Geospatial Information Science, vol. 27, no. 5, 2019, pp. 33-39.  https://doi.org/10.7319/kogsis.2019.27.5.033
  8. S. Saravit, J. Bae, K. Lee, and W. Cho, "Global Address Data Quality Verification and Improvement Techniques using Deep Learning," Journal of Korean Institute of Information Technology, vol. 20, no. 12, 2022, pp. 15-24.  https://doi.org/10.14801/jkiit.2022.20.12.15
  9. J. Kim, K. Lee, and W. Cho, "Overseas Address Data Quality Verification Technique using Artificial Intelligence Reflecting the Characteristics of Administrative System," The Korea Journal of BigData, vol. 7, no. 2, 2022, pp. 1-9.  https://doi.org/10.36498/KBIGDT.2022.7.2.1
  10. J. Devlin, M. W. Chang, K. Lee, and K. Toutanova, "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding," Arxiv, arXiv:1810.04805, 2018, pp. 1-16. 
  11. J. Lee, "Comparison of Sentiment Classification Performance of for RNN and Transformer-Based Models on Korean Reviews," Journal of the Korean Institute of Electronic Communication Sciences, vol. 18, no. 4, 2023, pp. 693-700. 
  12. D. Kim, S. Lee, and J. Bong, "Artificial Intelligence for Assistance of Facial Expression Practice Using Emotion Classification," Journal of the Korean Institute of Electronic Communication Sciences, vol. 17, no. 6, 2022, pp. 1137-1144. 
  13. A. Geron, Hands-on Machine Learning with Scikit-Learn, Keras and Tensorflow(2nd Edition). CA: O'Reilly, 2019. 
  14. Y. Lee and P. Moon, "A Comparison and Analysis of Deep Learning Framework," Journal of the Korean Institute of Electronic Communication Sciences, vol. 12, no. 1, 2017, pp. 115-122. 
  15. M. Seo, G. Ahn, and H. Sun, "Feature Selection Method from Multiclass Text with Class Imbalance Problem," Journal of Korean Institute of Industrial Engineers, vol. 45, no. 2, 2019, pp. 93-100. https://doi.org/10.7232/JKIIE.2019.45.2.093