Browse > Article
http://dx.doi.org/10.3743/KOSIM.2018.35.2.037

An Analytical Study on Automatic Classification of Domestic Journal articles Based on Machine Learning  

Kim, Pan Jun (신라대학교 문헌정보학과)
Publication Information
Journal of the Korean Society for information Management / v.35, no.2, 2018 , pp. 37-62 More about this Journal
Abstract
This study examined the factors affecting the performance of automatic classification based on machine learning for domestic journal articles in the field of LIS. In particular, In view of the classification performance that assigning automatically the class labels to the articles in "Journal of the Korean Society for Information Management", I investigated the characteristics of the key factors(weighting schemes, training set size, classification algorithms, label assigning methods) through the diversified experiments. Consequently, It is effective to apply each element appropriately according to the classification environment and the characteristics of the document set, and a fairly good performance can be obtained by using a simpler model. In addition, the classification of domestic journals can be considered as a multi-label classification that assigns more than one category to a specific article. Therefore, I proposed an optimal classification model using simple and fast classification algorithm and small learning set considering this environment.
Keywords
automatic classification; text categorization; performance factors; Journal articles; Rocchio; SVM (Support Vector Machine); NB (Naive Bayes); single-label classification; multi-label classification; machine learning;
Citations & Related Records
Times Cited By KSCI : 14  (Citation Analysis)
연도 인용수 순위
1 Kumar, M. A., & Gopal, M. (2010). A comparison study on multiple binary-class SVM methods for unilabel text categorization. Pattern Recognition Letters, 31(11), 1437-1444. https://doi.org/10.1016/j.patrec.2010.02.015   DOI
2 Li, C. H., & Park, S. C. (2009). An efficient document classification model using an improved back propagation neural network and singular value decomposition. Expert Systems with Applications, 36(2), 3208-3215. https://doi.org/10.1016/j.eswa.2008.01.014   DOI
3 Liu, Y., Loh, H. T., Yousef-Toumi, K., & Tor, S. B. (2007). Handling of imbalanced data in text classification: category-based term weights. In Kao, A., & Poteet, S. R. eds. Natural Language Processing and Text Mining. Springer, 171-192. https://doi.org/10.1007/978-1-84628-754-1_10
4 Miao, Yun-Qian, & Kamel, Mohamed (2011). Pairwise optimized rocchio algorithm for text categorization. Pattern Recognition, 32(2), 375-382. https://doi.org/10.1016/j.patrec.2010.09.018   DOI
5 Pawar, P. Y., & Gawande, S. H. (2012). Comparative study on different types of approaches to text categorization. International Journal of Machine Learning and Computing, 2(4), 423-426. https://doi.org/10.7763/ijmlc.2012.v2.158
6 Pedregosa, F. et al. (2011). Scikit-learn: Machine learning in python. Journal of Machine Learning Research, 12, 2825-2830.
7 Read, J. (2010). Scalable Multi-label Classification (Thesis, Doctor of Philosophy (PhD)). University of Waikato, Hamilton, New Zealand. Retrieved from https://hdl.handle.net/10289/4645
8 Read, J., Pfahringer, B., Holmes, G., & Frank, E. (2011). Classifier chains for multi-label classification. Machine Learning, 85, 333-359.   DOI
9 Schapire, R. E., & Singer, Y. (2000). BoosTexter: A boosting-based system for text categorization. Machine Learning, 39, 135-168.   DOI
10 Sebastiani, Fabrizio (2002). Machine learning in automated text categorization. ACM computing Surveys, 34(1), 1-47. https://doi.org/10.1145/505282.505283   DOI
11 Shehab, M. A., Badarneh, O., Al-Ayyoub, M., & Jararweh, Y. (2016). A supervised approach for multi-label classification of Arabic news articles, 7th International Conference on Computer Science and Information Technology (CSIT), Amman, 2016, 1-6. http://dx.doi.Org/10.1109/CSIT.2016.7549465   DOI
12 Uguz, Harun. (2011). A two-stage feature selection methods for text categorization by using information gain, principal component analysis and genetic algorithm. Knowledge-Based Systems, 24(7), 1024-1032. https://doi.org/10.1016/j.knosys.2011.04.014   DOI
13 Tarrago, D. S., Cornelis, C., Bello, R., & Herrera, F. (2014). A multi-instance learning wrapper based on the Rocchio classifier for web index recommendation. Knowledge-Based Systems, 59, 173-181. https://doi.org/10.1016/j.knosys.2014.01.008   DOI
14 Torii, M., Yin, L., Nguyen, T., Mazumdar, C. T., Liu, H., Hartley, D. M., & Nelson, N. P. (2011). An exploratory study of a text classification framework for Internet-based surveillance of emerging epidemics. International Journal of Medical Informatics, 80(1), 56-66. https://doi.org/10.1016/j.ijmedinf.2010.10.015   DOI
15 Tsoumakas G, Katakis I., & Vlahavas I. (2010). Mining multi-label data. In: Data mining and knowledge discovery handbook. Berlin: Springer, 667-685.
16 Vasuki, Vidya, & Cohen, Trevor (2010). Reflective random indexing for semi-automatic indexing of the biomedical literature. Journal of Biomedical Informatics, 43(5), 694-700. https://doi.org/10.1016/j.jbi.2010.04.001   DOI
17 Wu, Chih-Hung (2009). Behavior-based spam detection using a hybrid method of rule-based techniques and neural networks. Expert Systems with Applications, 36(1), 4321-4330. https://doi.org/10.1016/j.eswa.2008.03.002   DOI
18 Villena-Roman, J., Collada-Perez, S., Lana-Serrano, S., & Gonzalez-Cristobal, J. C. (2011). Hybrid approach combining machine learning and a rule-based expert system for text categorization. In Proceedings of the Twenty-Fourth International Florida Artificial Intelligence Research Society Conference, 323-328.
19 Vogrincic, Sergeja, & Bosnic, Zoran (2011). Ontology-based multi-label classification of economic articles. ComSIS, 8(1), 101-119. https://doi.org/10.2298/csis100420034v   DOI
20 Wang, Tai-Yue, & Chiang, Huei-Min (2007). Fuzzy support vector machine for multi-class text categorization. Information Processing and Management, 43(4), 914-929. https://doi.org/10.1016/j.ipm.2006.09.011   DOI
21 Yu, B., Xu, Zong-ben, & Li, Cheng-hua (2008). Latent semantic analysis for text categorization using neural network. Knowledge-Based Systems, 21(8), 900-904. https://doi.org/10.1016/j.knosys.2008.03.045   DOI
22 Kim, Pan Jun (2016). An analytical study on performance factors of automatic classification based on machine learning. Journal of the Korean Society for Information Management, 33(2), 33-59. http://dx.doi.org/10.3743/KOSIM.2016.33.2.033   DOI
23 Kang, Seung-Shik (2002). Korean Morphology and Information Retrieval. Hongrung Publishing Company.
24 Kim, Seong-Hee, & Eom, Jae-Eun (2012). A study on the documents's automatic classification using machine learning. Journal of Information Management, 39(4), 47-66. http://dx.doi.org/10.1633/JIM.2008.39.4.047   DOI
25 Kim, Yong-Hwan, & Chung, Young-Mee (2012). An experimental study on feature selection using wikipedia for text categorization. Journal of the Korean Society for information Management, 29(2), 155-171. http://dx.doi.Org/10.3743/KOSIM.2012.29.2.155   DOI
26 Kim, Jong-Min, & Yoo, Chang D. (2014). Linear classifier optimization for feature acquisition cost-sensitive classification. In Proceedings of the IEEK Conference, 37(1), 2021-2024.
27 Kim, Pan Jun (2006a). A study on automatic assignment of descriptors using machine learning. Journal of the Korean Society for Information Management, 23(1), 279-299. http://dx.doi.org/10.3743/KOSIM.2006.23.1.279   DOI
28 Kim, Pan Jun (2006b). A study on the automatic descriptor assignment for scientific journal articles uing rocchio algorithm. Journal of the Korean Society for Information Management, 23(3), 69-89. http://dx.doi.org/10.3743/KOSIM.2006.23.3.069   DOI
29 Kim, Pan Jun (2008). A study on the performance improvement of rocchio classifier with term weighting methods. Journal of the Korean Society for Information Management, 25(1), 211-233. http://dx.doi.org/10.3743/KOSIM.2008.25.1.211   DOI
30 Kim, Pan Jun, & Lee, Jae Yun (2007). Utilizing unlabeled documents in automatic classification with inter-document similarities. Journal of the Korean Society for Information Management, 24(1), 251-271. http://dx.doi.org/10.3743/KOSIM.2007.24.1.251   DOI
31 Kim, Pan Jun, & Lee, Jae Yun (2012). A study on the reclassification of author keywords for automatic assignment of descriptors. Journal of the Korean Society for Information Management, 29(2), 225-246. http://dx.doi.org/10.3743/KOSIM.2012.29.2.225   DOI
32 Shim, Kyung, & Chung, Young-Mee (2006). The effect of the quality of pre-assigned subject categories on the text categorization performance. Journal of the Korean Society for Information Management, 23(2), 265-285. http://dx.doi.org/10.3743/KOSIM.2006.23.2.265   DOI
33 Kim, Pan Jun, & Lee, Jae Yun (2014). An experimental study on the performance improvement of automatic classification for the articles of korean journals based on controlled keywords in international database. Journal of the Korean Society for Library and Information Science, 48(3), 491-510. http://dx.doi.org/10.4275/KSLIS.2014.48.3.491   DOI
34 Song, Sung-Jeon, & Chung, Young-Mee (2012). A study on improving the performance of document classification using the context of terms. Journal of the Korean Society for Information Management, 29(2), 205-224. http://dx.doi.Org/10.3743/KOSIM.2012.29.2.205   DOI
35 Shim, Kyung (2006). Optimization of number of training documents in text categorization. Journal of the Korean Society for Information Management, 23(4), 277-294. http://dx.doi.org/10.3743/KOSIM.2006.23.4.277   DOI
36 Lee, Yong-Gu (2009). Classification performance analysis of cross-language text categorization using machine translation. Journal of the Korean Society for Library and Information Science, 43(1), 313-332. http://dx.doi.org/10.4275/kslis.2009.43.1.313   DOI
37 Lee, Yong-Gu (2013). A study on feature selection for kNN classifier using document frequency and collection frequency. Journal of Korean Library and Information Science Society, 44(1), 27-47. http://dx.doi.org/10.16981/kliss.44.1.201303.27
38 Lee, Jae Yun (2005a). Improving the performance of a fast text classifier with document-side feature selection. Journal of Information Management, 36(4), 51-69. http://dx.doi.org/10.1633/jim.2005.36.4.051   DOI
39 Chung, Eun-Kyung (2009). A semantic-based feature expansion approach for improving the effectiveness of text categorization by using wordNet. Journal of the Korean Society for Information Management, 26(3), 261-278. http://dx.doi.Org/10.3743/KOSIM.2009.26.3.261   DOI
40 Lee, Jae Yun (2005b). An empirical study on improving the performance of text categorization considering the relationships between feature selection criteria and weighting methods. Journal of the Korean Society for Library and Information Science, 39(2), 123-146. http://dx.doi.org/10.4275/kslis.2005.39.2.123   DOI
41 National Research Foundation of Korea (2016). Research Field Classification Scheme. Retrieved from http://www.nrf.re.kr
42 Korea Citation Index (2018). Retrieved from https://www.kci.go.kr
43 AI-Salemi, B., Aziz, M., Juzaiddin, A., & Noah, S. (2015). Boosting algorithms with topic modeling for multi-label text categorization: A comparative empirical study. Journal of Information Science, 41(5), 732-746. http://dx.doi.Org/10.1177/0165551515590079   DOI
44 Chen, E., Lin, Y., Xiong, H., Luo, Q., & Ma, H. (2011). Exploiting probabilistic topic models to improve text categorization under class imbalance. Information Processing and Management, 47(2), 202-214.   DOI
45 Chen, Yao-Tsung, & Chen, Meng Chang (2011). Using chi-square statistics to measure similarities for text categorization. Expert Systems with Application, 38(4), 3085-3090.   DOI
46 Dalal, M. K., & Zaveri, M. A. (2012). Automatic text classification of sports blog data, proceedings of the ieee international conference on computing, communications and applications (ComComAp 2012), Hong Kong, 11-13 January 2012, 219-222.
47 Hmeidi, I., Al-Ayyoub, M., Abdulla, N. A., Almodawar, A. A., Abooraig, R., & Mahyoub, N. A. (2015). Automatic arabic text categorization: A comprehensive comparative study. Journal of Information Science, 41(1), 114-124. https://doi.org/10.1177/0165551514558172   DOI
48 Dalal, M. K., & Zaveri, M. A. (2013). Automatic classification of unstructured blog text. Journal of Intelligent Learning Systems and Applications, 5(2), 108-114. http://dx.doi.Org/10.4236/jilsa.2013.52012.   DOI
49 Eriksson, Tobias (2013). Automatic web page categorization using text classification methods. Master's Degree Project in Computer Science CSC School of Computer Science and Communication.
50 Foulds, J., & Frank, E. (2010). A review of multi-instance learning assumptions. Knowl. Eng. Rev., 25(1), 1-25.   DOI
51 Jiang, S., Pang, G., Wu, M., & Kuang, L. (2012). An improved k-nearest-neighbor algorithm for text categorization. Expert Systems with Applications, 39(1), 1503-1509. https://doi.org/10.1016/j.eswa.2011.08.040   DOI
52 Jindal, Rajni, Malhotra, Ruchika, & Jain, Abha. (2015). Techniques for text classification: Literature review and current trends. Webology, 12(2), 2-28.
53 Joorabchi, A., & Mahdi, A. E. (2011). An unsupervised approach to automatic classification of scientific literature utilizing bibliographic metadata. Journal of Information Science, 37(5), 499-514. https://doi.org/10.1177/0165551511417785   DOI
54 Khan, A., Baharudin, B., & Lee, L. H. (2010). A review of machine learning algorithms for text-documents classification. Journal of Advances in Information Technology, 1(1), 4-20. https://doi.org/10.4304/jait.1.1.4-20