DOI QR코드

DOI QR Code

An Experimental Study on the Automatic Classification of Korean Journal Articles through Feature Selection

자질선정을 통한 국내 학술지 논문의 자동분류에 관한 연구

  • Received : 2022.02.14
  • Accepted : 2022.03.04
  • Published : 2022.03.30

Abstract

As basic data that can systematically support and evaluate R&D activities as well as set current and future research directions by grasping specific trends in domestic academic research, I sought efficient ways to assign standardized subject categories (control keywords) to individual journal papers. To this end, I conducted various experiments on major factors affecting the performance of automatic classification, focusing on feature selection techniques, for the purpose of automatically allocating the classification categories on the National Research Foundation of Korea's Academic Research Classification Scheme to domestic journal papers. As a result, the automatic classification of domestic journal papers, which are imbalanced datasets of the real environment, showed that a fairly good level of performance can be expected using more simple classifiers, feature selection techniques, and relatively small training sets.

국내 학술연구의 동향을 구체적으로 파악하여 연구개발 활동의 체계적인 지원 및 평가는 물론 현재와 미래의 연구 방향을 설정할 수 있는 기초 데이터로서, 개별 학술지 논문에 표준화된 주제 범주(통제키워드)를 부여할 수 있는 효율적인 방안을 모색하였다. 이를 위해 한국연구재단 「학술연구분야분류표」 상의 분류 범주를 국내학술지 논문에 자동 할당하는 과정에서, 자질선정 기법을 중심으로 자동분류의 성능에 영향을 미치는 주요 요소들에 대한 다각적인 실험을 수행하였다. 그 결과, 실제 환경의 불균형 데이터세트(imbalanced dataset)인 국내 학술지 논문의 자동분류에서는 보다 단순한 분류기와 자질선정 기법, 그리고 비교적 소규모의 학습집합을 사용하여 상당히 좋은 수준의 성능을 기대할 수 있는 것으로 나타났다.

Keywords

References

  1. Chung, Eunkyung (2009). A semantic-based feature expansion approach for improving the effectiveness of text categorization by using WordNet. Journal of the Korean Society for information Management, 26(3), 261-278. https://doi.org/10.3743/KOSIM.2009.26.3.261
  2. KCI(Korea Citation Index) (2022). Data Statistics. National Research Foundation of Korea. Available: https://www.kci.go.kr/kciportal/po/statistics/poStatisticsMain.kci?tab_code=Tab3
  3. Kim, Pan Jun & Lee, Jae Yun (2012). A study on the reclassification of author keywords for automatic assignment of descriptors. Journal of the Korean Society for Information Management, 29(2), 225-246. https://doi.org/10.3743/KOSIM.2012.29.2.225
  4. Kim, Pan Jun & Lee, Jae Yun (2018). An experimental study on the performance improvement of automatic classification for the articles of Korean journals based on controlled keywords in international database. Journal of the Korean Library and Information Science, 48-3, 491-510. https://doi.org/10.4275/KSLIS.2014.48.3.491
  5. Kim, Pan Jun (2006). A study on automatic assignment of descriptors using machine learning. Journal of the Korean Society for Information Management, 23(1), 279-299. https://doi.org/10.3743/KOSIM.2006.23.1.279
  6. Kim, Pan Jun (2016). An analytical study on performance factors of automatic classification based on machine learning. Journal of the Korean Society for Information Management, 33(2), 33-59. http://dx.doi.org/10.3743/KOSIM.2016.33.2.033
  7. Kim, Pan Jun (2018). An analytical study on automatic classification of domestic journal articles based on machine learning. Journal of the Korean Society for Information Management, 35(2), 37-62. https://doi.org/10.3743/KOSIM.2018.35.2.037
  8. Kim, Pan Jun (2019). An analytical study on automatic classification of domestic journal articles using random forest. Journal of the Korean Society for Information Management, 36(2), 37-62. https://doi.org/10.3743/KOSIM.2019.36.2.057
  9. Kim, Pan Jun (2021a). A study on the characteristics by keyword types in the intellectual structure analysis based on co-word analysis: focusing on overseas open access field. Journal of the Korean Library and Information Science, 55-3, 103-129. http://dx.doi.org/10.4275/KSLIS.2021.55.3.103
  10. Kim, Pan Jun (2021b). A study on the intellectual structure analysis by keyword type based on profiling: focusing on overseas open access field. Journal of the Korean Library and Information Science, 55-4, 115-140. http://dx.doi.org/10.4275/KSLIS.2021.55.4.115
  11. Kim, Seon-Wu, Ko, Gun-Woo, Choi, Won-Jun, Jeong, Hee-Seok, Yoon, Hwa-Mook, & Choi, Sung-Pil (2018). Semi-automatic construction of learning set and integration of automatic classification for academic literature in technical sciences. Journal of the Korean Society for Information Management, 35(4), 141-164. http://dx.doi.org/10.3743/KOSIM.2018.35.4.141
  12. Lee, Jae Yun (2005). An empirical study on improving the performance of text categorization considering the relationships between feature selection criteria and weighting methods. Journal of the Korean Society for Library and Information Science, 39(2), 123-146. http://dx.doi.org/10.4275/kslis.2005.39.2.123
  13. National Research Foundation of Korea (2016). Academic Research Classification Scheme. Available: https://www.nrf.re.kr/biz/doc/class/view?menu_no=323
  14. Yuk, Jee Hee & Song, Min (2018). A study of research on methods of automated biomedical document classification using topic modeling and deep learning. Journal of the Korean Society for information Management, 35(2), 63-88. https://doi.org/10.3743/KOSIM.2018.35.2.063
  15. Abiodun, E. O., Alabdulatif, A., Abiodun, O. I., Alawida, M., Alabdulatif, A., & Alkhawaldeh, R. S. (2021). A systematic review of emerging feature selection optimization methods for optimal text classification: the present state and prospective opportunities. Neural Computing & Applications, 33(4), 1-28. https://doi.org/10.1007/s00521-021-06406-8
  16. Cai, J., Luo, J., Wang, S., & Yang, S. (2018). Feature selection in machine learning: a new perspective. Neurocomputing, 300, 70-79. https://doi.org/10.1016/j.neucom.2017.11.077
  17. Chandrashekar, G. & Sahin, F. (2014) A survey on feature selection methods. Computers & Electrical Engineering, 40(1), 16-28. https://doi.org/10.1016/j.compeleceng.2013.11.024
  18. Chang, F., Guo, J., Xu, W., & Yao, K. (2015). A feature selection method to handle imbalanced data in text classification. Journal of Digital Information Management, 13, 169-175. Available: https://www.dline.info/fpaper/jdim/v13i3/v13i3_6.pdf
  19. Deng, X., Li, Y., Weng, J., & Zhang, J. (2019). Feature selection for text classification: a review. Multimedia Tools and Applications, 78, 3797-3816. https://doi.org/10.1007/s11042-018-6083-5
  20. Drotar, P., Gazda, J., & Smekal, Z. (2015). An experimental comparison of feature selection methods on two-class biomedical datasets. Computers in Biology and Medicine, 66, 1-10. https://doi.org/10.1016/j.compbiomed.2015.08.010
  21. Drotar, P., Gazda, M., & Vokorokos, L. (2019). Ensemble feature selection using election methods and ranker clustering. Information Sciences, 480, 365-380. https://doi.org/10.1016/j.ins.2018.12.033
  22. Forman, G. (2003). An extensive empirical study of feature selection metrics for text classification. The Journal of Machine Learning Research, 3, 1289-1305. Available: https://www.jmlr.org/papers/volume3/forman03a/forman03a_full.pdf
  23. Fragoudis, D., Meretakis, D., & Likothanassis, S. (2005). Best terms: an efficient feature-selection algorithm for text categorization. Knowledge and Information Systems, 8(1), 16-33. https://doi.org/10.1007/s10115-004-0177-2
  24. Gunal, S. (2012). Hybrid feature selection for text classification. Turkish Journal of Electrical Engineering and Computer Science, 20(Sup.2), 1296-1311. Available: https://dergipark.org.tr/en/pub/tbtkelektrik/issue/12058/144170
  25. Gutkin, M., Shamir, R., & Dror, G. (2009). SlimPLS: a method for feature selection in gene expression-based disease classification. PloS One, 4(7), e6416. https://doi.org/10.1371/journal.pone.0006416
  26. Guyon, I. & Elisseeff, A. (2003). An introduction to variable and feature selection. The Journal of Machine Learning Research, 3, 1157-1182. Available: https://dl.acm.org/doi/pdf/10.5555/944919.944968
  27. Harish, B. & Revanasiddappa, M. (2017). A comprehensive survey on various feature selection methods to categorize text documents. International Journal of Computer Applications, 164, 1-7. http://doi.org/10.5120/ijca2017913711
  28. Iqbal, M., Abid, M. M., Khalid, M. N., & Manzoor, A. (2020). Review of feature selection methods for text classification. International Journal of Advanced Computer Research, 10(49), 138-152. http://dx.doi.org/10.19101/IJACR.2020.1048037
  29. Joachims, T. (1997). A probabilistic analysis of the Rocchio algorithm with TFIDF for text categorization. Proceedings of the Fourteenth International Conference on Machine Learning (ICML '97), 143-151. Available: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.45.6977&rep=rep1&type=pdf
  30. Joachims, T. (2002). Learning to Classify Text Using Support Vector Machines: Methods, theory and algorithms. USA: Kluwer Academic Publishers.
  31. Kashef, S., Nezamabadi-pour, H., & Nikpour, B. (2018). Multi-label feature selection: a comprehensive review and guiding experiments. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 8(2), e1240. https://doi.org/10.1002/widm.1240
  32. Kragelj, M. & Kljajic Borstnar, M. (2021). Automatic classification of older electronic texts into the Universal Decimal Classification-UDC. Journal of Documentation, 77(3), 755-776. https://doi.org/10.1108/JD-06-2020-0092
  33. Kumar, V. & Minz, S. (2014). Feature selection: a literature review. Smart Computing Review, 4(3), 211-229. Available: https://faculty.cc.gatech.edu/~hic/CS7616/Papers/Kumar-Minz-2014.pdf
  34. Manning, C., Raghavan, P., & Schutze, H. (2008). Introduction to information retrieval. NY, USA: Cambridge University Press.
  35. Mengle, S. S. R. & Goharian, N. (2009). Ambiguity measure feature-selection algorithm. Journal of the American Society for Information Science & Technology, 60(5), 1037-1050. https://doi.org/10.1002/asi.21023
  36. Mironczuk, M. & Protasiewicz, J. (2018). A recent overview of the state-of-the-art elements of text classification. Expert Systems with Applications, 106, 36-54. https://doi.org/10.1016/j.eswa.2018.03.058
  37. Pereira, R. B., Plastino, A., Zadrozny, B., & Merschmann, L. H. (2018). Correlation analysis of performance measures for multi-label classification. Information Processing & Management, 54(3), 359-369. https://doi.org/10.1016/j.ipm.2018.01.002
  38. Pinheiro, R. H. W., Cavalcanti, G. D. C., & Ren, T. I. (2015). Data-driven global-ranking local feature selection methods for text categorization, Expert Systems with Applications, 42 (4), 1941-1949. https://doi.org/10.1016/j.eswa.2014.10.011
  39. Pintas, J. T., Fernandes, L. A. F., & Garcia, A. C. B. (2021). Feature selection methods for text classification: a systematic literature review. Artificial Intelligence Review, 54, 6149-6200. https://doi.org/10.1007/s10462-021-09970-6
  40. Rehman, A., Javed, K., Babri, H. A., & Asim, N. (2018). Selection of the most relevant terms based on a max-min ratio metric for text classification. Expert Systems with Applications, 114, 78-96. https://doi.org/10.1016/j.eswa.2018.07.028
  41. Salton, G. & Buckley, C. (1988). Term-weighting approaches in automatic text retrieval. Information Processing & Management, 24(5), 513-523. https://doi.org/10.1016/0306-4573(88)90021-0
  42. Sebastiani, F. (2002). Machine learning in automated text categorization. ACM Computing Surveys (CSUR), 34(1), 1-47. https://doi.org/10.1145/505282.505283
  43. Talavera, L. (2005). An evaluation of filter and wrapper methods for feature selection in categorical clustering. In International Symposium on Intelligent Data Analysis. Springer, Berlin, Heidelberg, 440-451. https://doi.org/10.1007/11552253_40
  44. Uysal, A. K. (2016). An improved global feature selection scheme for text classification. Expert Systems with Applications, 43(1), 82-92, https://doi.org/10.1016/j.eswa.2015.08.050
  45. Venkatesh, B. & Anuradha, J. (2019). A review of feature selection and its methods. Cybernetics and Information Technologies, 19(1), 3-26. https://doi.org/10.2478//cait-2019-0001
  46. Wang, D., Zhang, H., Liu, R., Liu, X., & Wang, J. (2016). Unsupervised feature selection through gram-Schmidt orthogonalization-A word co-occurrence perspective. Neurocomputing, 173(P3), 845-854. https://doi.org/10.1016/j.neucom.2015.08.038
  47. Wang, D., Zhang, H., Liu, R., Lv, W., & Wang, D. (2014). t-test feature selection approach based on term frequency for text categorization. Pattern Recognition Letters, 45, 1-10. https://doi.org/10.1016/j.patrec.2014.02.013
  48. Wu, Y. & Zhang, A. (2004). Feature selection for classifying high-dimensional numerical data. Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004, CVPR 2004, 2, 251-258. http://doi.org/10.1109/CVPR.2004.1315171
  49. Yang, Y. & Pedersen. J. O. (1997). A comparative study on feature selection in text categorization. In Proceedings of the Fourteenth International Conference on Machine Learning, July 08-12, 412-420. Available: http://nyc.lti.cs.cmu.edu/yiming/Publications/yang-icml97.pdf