Browse > Article
http://dx.doi.org/10.5762/KAIS.2020.21.1.169

Research on Text Classification of Research Reports using Korea National Science and Technology Standards Classification Codes  

Choi, Jong-Yun (Department of Computer Engineering, Kumoh National Institute of Technology)
Hahn, Hyuk (Korea Institute of Science and Technology Information)
Jung, Yuchul (Department of Computer Engineering, Kumoh National Institute of Technology)
Publication Information
Journal of the Korea Academia-Industrial cooperation Society / v.21, no.1, 2020 , pp. 169-177 More about this Journal
Abstract
In South Korea, the results of R&D in science and technology are submitted to the National Science and Technology Information Service (NTIS) in reports that have Korea national science and technology standard classification codes (K-NSCC). However, considering there are more than 2000 sub-categories, it is non-trivial to choose correct classification codes without a clear understanding of the K-NSCC. In addition, there are few cases of automatic document classification research based on the K-NSCC, and there are no training data in the public domain. To the best of our knowledge, this study is the first attempt to build a highly performing K-NSCC classification system based on NTIS report meta-information from the last five years (2013-2017). To this end, about 210 mid-level categories were selected, and we conducted preprocessing considering the characteristics of research report metadata. More specifically, we propose a convolutional neural network (CNN) technique using only task names and keywords, which are the most influential fields. The proposed model is compared with several machine learning methods (e.g., the linear support vector classifier, CNN, gated recurrent unit, etc.) that show good performance in text classification, and that have a performance advantage of 1% to 7% based on a top-three F1 score.
Keywords
Deep Learning; Text Classification; Research Report; Preprocessing; NTIS;
Citations & Related Records
Times Cited By KSCI : 2  (Citation Analysis)
연도 인용수 순위
1 S. Fabrizio. 2002. "Machine Learning in Automated Text Categorization." ACM Computing Surveys 34: 1-47. DOI:https://doi.org/10.1145/505282.505283   DOI
2 L. Saitta. 1995. Nov "Support-Vector Networks." Machine Learning 20(3): 273-97. DOI: https://doi.org/10.1007/BF00994018   DOI
3 C. Nello, J. Shawe-Taylor, and B. Williamson. 2001. "On the Algorithmic Implementation of Multiclass Kernel-Based Vector Machines." Machine Learning Research 2: 265-92. DOI: https://doi.org/10.1007/BF00994018
4 Y. H. Kim, S. Y. Kang, and M. J. Choi. 2015. "Improvement of National Science and Technology Standard Classification System in 2015" Research and Development, Korea Institute of Science and Technology Evaluation and Planning, Korea, pp.1-221.
5 J. Weston, et al. 2000. "Feature Selection for SVMs." Advances in Neural Information Processing Systems 13: 668-674.
6 Scikit learn's SVC, Available at https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html
7 X. Zhang, J. Zhao, and Y. LeCun. 2015. Character-level convolutional networks for text classification. arXiv preprint arXiv:1509.01626.
8 S. Hochreiter, and J. Schmidhuber. 1997. "Long Short-Term Memory." Neural Computation 9(8): p.1735-1780. DOI: https://doi.org/10.1162/neco.1997.9.8.1735   DOI
9 J. Y. Chung, G. Caglar, K. H. Cho, and Y. Bengio. 2014. "Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling." NIPS 2014 Workshop on Deep Learning: p.1-9.
10 P. Zhou et al. 2016. "Text Classification Improved by Integrating Bidirectional LSTM with Two-Dimensional Max Pooling." Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics 2(1): 3485-95.
11 T. Mikolov, et al. 2013. "Distributed Representations of Words and Phrases and Their Compositionality." Advances in Neural Information Processing Systems 26 (NIPS 2013): 1-9.
12 Gensim Word2Vec, Available at https://radimrehurek.com/gensim/models/word2vec.html
13 J. Pennington, R. Socher, and C. D. Manning. 2014. "GloVe : Global Vectors for Word Representation." EMNLP: 1532-1543. DOI: https://doi.org/10.3115/v1/D14-1162
14 H. Jo, et al. 2015. "Large-Scale Text Classification Methodology with Convolutional Neural Network." Korean Information Science Society: 792-94. DOI: http://dx.doi.org/10.5626/KTCP.2017.23.5.322
15 E. J. Park, and S. Z. Cho. 2014. "KoNLPy : Korean Natural Language Processing in Python." Annual Conference on Human and Language Technology: pp.133-136.
16 Scikit learn's Linear SVC, Available at https://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html
17 H. Y. Jo, et al. 2017. "Large-Scale Text Classification with Deep Neural Networks." KIISE Transactions on Computing Practices 23: 322-27. DOI: https://doi.org/10.5626/KTCP.2017.23.5.322   DOI
18 M. J. Seo, G. S. Ahn, and S. Hur. 2019. "Feature Selection Method from Multiclass Text with Class Imbalance Problem." Journal of the Korean Institute of Industrial Engineers (April): 1-8.
19 J. S. Jeong et al. 2019. "Related Documents Classification System by Similarity between Documents." The Korean Society Of Broad Engineers 24(1): 77-86. DOI: https://doi.org/10.5909/JBE.2019.24.1.77
20 K. Y. Kim and C. J. Park. 2019. "Automatic IPC Classification of Patent Documents Using Word2Vec and Two Layers Bidirectional Long Short Term Memory Network." THE JOURNAL OF KOREAN INSTITUTE OF NEXT GENERATION COMPUTING 15(2): 50-60.
21 K. Kowsari et al. 2017. "HDLTex : Hierarchical Deep Learning for Text Classification." 2017 16th IEEE International Conference on Machine Learning and Applications (ICMLA): 364-71. DOI: https://doi.org/10.1109/ICMLA.2017.0-134
22 P. Liu, X. Qiu, and X. Huang. 2016. "Recurrent Neural Network for Text Classification with Multi-Task Learning." AAAI Publications, Twenty-Ninth AAAI Conference on Artificial Intelligence: 2267-2273.
23 Jacob, Devlin, Ming-wei Chang, Kenton Lee, and Kristina Toutanova. 2019. "BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding." NAACL-HLT: 4171-4186.
24 R. A. Sinoara et al. 2019. "Knowledge-Based Systems Knowledge-Enhanced Document Embeddings for Text Classification." Knowledge-Based Systems 163: 955-71. DOI: https://doi.org/10.1016/j.knosys.2018.10.026   DOI
25 S. Lai, L. Xu, K. Liu, and J. Zhao. 2015. "Recurrent Convolutional Neural Networks for Text Classification." Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence Recurrent: 2267-73.
26 C. H. Song, and S. S. Sung. 2006. "A Study on the Problems of Current National Standard Classification of Science and Technology for National Science and Technology Information System." : pp.496-513.
27 Y. Kim. 2014. "Convolutional Neural Networks for Sentence Classification." EMNLP 2014: 1746-51. DOI: https://doi.org/10.3115/v1/D14-1181