Browse > Article
http://dx.doi.org/10.3743/KOSIM.2018.35.4.141

Semi-automatic Construction of Learning Set and Integration of Automatic Classification for Academic Literature in Technical Sciences  

Kim, Seon-Wu (경기대학교 문헌정보학과)
Ko, Gun-Woo (경기대학교 문헌정보학과)
Choi, Won-Jun (한국과학기술정보연구원 콘텐츠큐레이션센터)
Jeong, Hee-Seok (한국과학기술정보연구원 콘텐츠큐레이션센터)
Yoon, Hwa-Mook (한국과학기술정보연구원 콘텐츠큐레이션센터)
Choi, Sung-Pil (경기대학교 문헌정보학과)
Publication Information
Journal of the Korean Society for information Management / v.35, no.4, 2018 , pp. 141-164 More about this Journal
Abstract
Recently, as the amount of academic literature has increased rapidly and complex researches have been actively conducted, researchers have difficulty in analyzing trends in previous research. In order to solve this problem, it is necessary to classify information in units of academic papers. However, in Korea, there is no academic database in which such information is provided. In this paper, we propose an automatic classification system that can classify domestic academic literature into multiple classes. To this end, first, academic documents in the technical science field described in Korean were collected and mapped according to class 600 of the DDC by using K-Means clustering technique to construct a learning set capable of multiple classification. As a result of the construction of the training set, 63,915 documents in the Korean technical science field were established except for the values in which metadata does not exist. Using this training set, we implemented and learned the automatic classification engine of academic documents based on deep learning. Experimental results obtained by hand-built experimental set-up showed 78.32% accuracy and 72.45% F1 performance for multiple classification.
Keywords
automatic classification; text mining; NLP(Natural Language Processing); deep learning; semi-supervised learning;
Citations & Related Records
Times Cited By KSCI : 7  (Citation Analysis)
연도 인용수 순위
1 Kim, Seon-Wu, Yu, Seok-Jong, Lee, Min-Ho, & Choi, Sung-Pil (2017). A comparative study on deep learning topology for event extraction from biomedical literature. The Journal of Korean Literature Information, 51(4), 77-97. https://doi.org/10.4275/KSLIS.2017.51.4.077   DOI
2 Kim, Seon-Wu, & Choi, Sung-Pil (2018). Research on joint models for korean word spacing and POS tagging based on bidirectional LSTM-CRF. Journal of Information Science, 45(8), 792-800.
3 Kim, Pan-Jun (2018). An analytical study on automatic classification of domestic journal articles based on machine learning. Information Management Journal, 35(2), 37-62. https://doi.org/10.3743/KOSIM.2018.35.2.037   DOI
4 Kim, Pan-Jun, & Lee, Jae-Yun (2014). An experimental study on the performance improvement of automatic classification for the articles of korean journals based on controlled keywords in international database. Journal of the Korean Society for Library and Information Science, 48(3), 491-510. https://doi.org/10.4275/KSLIS.2014.48.3.491   DOI
5 Ra. Dong-Yul, Kang, Hyun-Kyu, Kim, Hyun-Tae, Park, Kyung-Il, Jang, Hyeong-Il, Yeom, Sung-Wook, ... & Shin, Hyun-Ju (2007). Development of a test collection HANTEC for evaluating information retrieval.management.service. (report no. K-07-IP-02-03S-7). Korea Institute of Science and Technology Information.
6 Ra, Dong-Yul, Kim, Yun-Sik, Shin, Hyun-Joo, Lee, Kyu-Hee, Kim, Tae-Kyu, Kang, Hyun-Kyu, ... & Yoon, Hwa-Mook (2007). Developing a test collection for korean text categorization. Proceedings of the Korea Contents Association Conference, 5(1), 435-439.
7 Noh, Dae-Wook, Lee, Soo-Yong, & Ra, Dong-Yul (2007). Developing a text categorization system based on unsupervised learning using an information retrieval technique. Information Science Journal: Software and Application, 34(2), 160-168.
8 Lee, Da-Bin, & Choi, Sung-Pil (2018). In-depth comparative analysis of various korean morpheme embedding models using massive textual resource. Korea Information Science Society Academic Conference Academic Literature, 613-615.
9 Park, Young-Keun, Park, Su-Bin, Park, No-il, & Lee, Hyun-Ah (2017). Web news classification using latent semantic analysis. Korea Information Science Society Academic Conference Academic Literature, 1828-1830.
10 Yuk, Jee-Hee, & Song, Min (2018). A study of research on methods of automated biomedical document classification using topic modeling and deep learning. The Journal of Information Management, 35(2), 63-88. https://doi.org/10.3743/KOSIM.2018.35.2.063   DOI
11 Lee, Yong-Gu (2013). A study on the quality selection of KNN classifiers using frequency of documents and frequency of collections. Journal of Korean Library and Information Science Society, 44(1), 27-47. http://doi.org/10.16981/kliss.44.1.201303.27   DOI
12 Cho, Hyun-Soo, & Lee, Sang-Goo (2017). Korean word embedding using fasttext. Korea Information Science Society Academic Conference Academic Literature, 705-707.
13 Cho, Hyun-Yang (2017). A experimental study on the development of a book recommendation system using automatic classification, Based on the Personality Type. Journal of Korean Library and Information Science Society, 48(2), 215-236. http://doi.org/10.16981/kliss.48.2.201706.215   DOI
14 Cho, Hui-Yeol, Kim, Jin-Hwa, Yoon, Sang-Woong, Kim, Kyung-Min, & Zhang, Byung-Tak (2015). Large-scale text classification methodology with convolutional neural network. Korea Information Science Society Academic Conference Academic Literature, 792-794.
15 Choi, Ga-Ram, & Choi, Sung-Pil (2018). A study on the deduction of social issues applying word embedding: With an empasis on news articles related to the disables. The Journal of Information Management, 35(1), 231-250. https://doi.org/10.3743/KOSIM.2018.35.1.231   DOI
16 Bojanowski, P., Grave, E., Joulin, A., & Mikolov, T. (2016). Enriching word vectors with subword information. arXiv preprint arXiv:1607.04606.
17 Choi, Sung-Pil, Yoo, Suk-Jong, & Cho, Hyun-Yang (2016). A study on the semiautomatic construction of domain-specific relation extraction datasets from biomedical abstracts - Mainly focusing on a genic interaction dataset in alzheimer's disease domain -. Journal of Korean Library and Information Science Society, 47(4), 289-307. https://doi.org/10.16981/kliss.47.4.201612.289   DOI
18 Han, Kyu-Yeol, & Ahn, Young-Min (2013). Automatic labeling of korean document clusters created by LDA. Journal of Korean Society of Information Science. Korea Information Science Society Academic Conference Academic Literature, 616-618.
19 Bock, H. H. (2007). Clustering methods: a history of k-means algorithms. In Selected contributions in data analysis and classification, 161-172. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-73560-1_15
20 Choi, S. P. (2018). Extraction of protein-protein interactions (PPIs) from the literature by deep convolutional neural networks with various feature embeddings. Journal of Information Science, 44(1), 60-73. https://doi.org/10.1177/0165551516673485   DOI
21 Joulin, A., Grave, E., Bojanowski, P., & Mikolov, T. (2016). Bag of tricks for efficient text classification. arXiv preprint arXiv:1607.01759.
22 Kowsari, K., Brown, D. E., Heidarysafa, M., Meimandi, K. J., Gerber, M. S., & Barnes, L. E. (2017, December). Hdltex: Hierarchical deep learning for text classification. In Machine Learning and Applications (ICMLA), 2017 16th IEEE International Conference on, 364-371. https://doi.org/10.1109/ICMLA.2017.0-134
23 Shinyama, Y. (2004). PDFMiner. Retrieved from https://euske.github.io/pdfminer/
24 Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems, 3111-3119.
25 Pennington, J., Socher, R., & Manning, C. (2014). Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 1532-1543. http://dx.doi.org/10.3115/v1/D14-1162
26 Shafiabady, N., Lee, L. H., Rajkumar, R., Kallimani, V. P., Akram, N. A., & Isa, D. (2016). Using unsupervised clustering approach to train the support vector machine for text classification. Neurocomputing, 211, 4-10. https://doi.org/10.1016/j.neucom.2015.10.137   DOI