Browse > Article
http://dx.doi.org/10.3743/KOSIM.2019.36.2.057

An Analytical Study on Automatic Classification of Domestic Journal articles Using Random Forest  

Kim, Pan Jun (신라대학교 문헌정보학과)
Publication Information
Journal of the Korean Society for information Management / v.36, no.2, 2019 , pp. 57-77 More about this Journal
Abstract
Random Forest (RF), a representative ensemble technique, was applied to automatic classification of journal articles in the field of library and information science. Especially, I performed various experiments on the main factors such as tree number, feature selection, and learning set size in terms of classification performance that automatically assigns class labels to domestic journals. Through this, I explored ways to optimize the performance of random forests (RF) for imbalanced datasets in real environments. Consequently, for the automatic classification of domestic journal articles, Random Forest (RF) can be expected to have the best classification performance when using tree number interval 100~1000(C), small feature set (10%) based on chi-square statistic (CHI), and most learning sets (9-10 years).
Keywords
automatic classification; automatic annotation; digital curation; journal articles; random forest (RF); multi-label classification; imbalanced data; feature selection;
Citations & Related Records
Times Cited By KSCI : 7  (Citation Analysis)
연도 인용수 순위
1 Kim, S., & Ahn, H. (2016). Application of Random Forests to corporate credit rating prediction. The Journal of Business and Economics, 32(1), 187-211.
2 Kim, P. J. (2006). A Study on automatic assignment of descriptors using machine learning. Journal of the Korean Society for information Management, 23(1), 279-299. https://doi.org/10.3743/KOSIM.2006.23.1.279   DOI
3 Kim, Pan Jun (2016). An analytical study on performance factors of automatic classification based on machine learning. Journal of the Korean Society for information Management, 33(2), 33-59. http://dx.doi.org/10.3743/KOSIM.2016.33.2.033   DOI
4 Lee, H., Shin, D., Park, H., Kim, S., & Shin, D. (2011). Research on the modified algorithm for improving accuracy of Random Forest classifier which identifies automatically arrhythmia. The KIPS Transactions: Part B, 18(6), 341-348.
5 Jeong, S., Choi, M., & Kim, H. (2016). Coreference resolution for Korean using Random Forests. Journal of KIISE, 5(11), 535-540.
6 Jeong, J., Jang, K., & Kim, J. (2016). Target classification method using Random Forest and genetic algorithm. 2016 IEIE Fall Conference, 601-604.
7 Jo, H., & Park, C. (2018). Analysis of reporting characteristics of newspapers in the 19th presidential election based on random forest. Journal of the Korean data & information science society, 29(2), 367-375. http://dx.doi.org/10.7465/jkdi.2018.29.2.367   DOI
8 Choi, H., Choi, S., & Han, K. (2012). Prediction of DNA binding sites in proteins using a Random Forest. Journal of KIISE, 39(7), 515-522.
9 Hong, J., Ko, B., & Nam, J. (2013). Human action recognition in still image using weighted bag-of-features and ensemble decision trees. The Journal of Korean Institute of Communications and Information Sciences, 38(1), 1-9. https://doi.org/10.7840/kics.2013.38A.1.1   DOI
10 Afianto, M. F., Adiwijaya, & Al-Faraby, S. (2017). Text categorization on Hadith Sahih Al-Bukhari using Random Forest, International Conference on Data and Information Science, IOP Conference Series: Journal of Physics: Conf. Series 971. http://doi.org/10.1088/1742-6596/971/1/012037
11 Amaratunga, D., Cabrera, J., & Lee, Y. (2008). Enriched random forests. Bioinformatics, 24(18), 2010-2014. https://doi.org/10.1093/bioinformatics/btn356   DOI
12 Low, F., Schorcht, G., Michel, U., Dech, S., & Conrad, C. (2012). Per-field crop classification in irrigated agricultural regions in middle Asia using random forest and support vector machine ensemble. Proc. SPIE 8538, Earth Resources and Environmental Remote Sensing/GIS Applications III, 85380R (25 October 2012). http://doi.org/10.1117/12.974588
13 Lee, Jaesung, & Kim, Dae-Won (2015). Mutual information-based multi-label feature selection using interaction information. Expert Systems with Applications, 42(4), 2013-2025. https://doi.org/10.1016/j.eswa.2014.09.063   DOI
14 Liparas D., HaCohen-Kerner Y., Moumtzidou A., Vrochidis S., & Kompatsiaris I. (2014). News articles classification using Random Forests and weighted multimodal features. In: Lamas D., Buitelaar P. (eds) Multidisciplinary Information Retrieval. IRFC 2014. Lecture Notes in Computer Science, vol 8849. Springer, Cham. https://doi.org/10.1007/978-3-319-12979-2_6
15 Lok, C. (2010). Speed reading. Nature 463, 28. http://doi.org/10.1038/463416a
16 Ma, L. (2017). A multi-label text classification framework: Using supervised and unsupervised feature selection strategy. Unpublished doctoral dissertation, Georgia State University. retrieved from https://scholarworks.gsu.edu/cs_diss/134
17 Ma, L., & Fan, S. (2017). CURE-SMOTE algorithm and hybrid algorithm for feature selection and parameter optimization based on random forests. BMC Bioinformatics, 18(1), 169. https://doi.org/10.1186/s12859-017-1578-z   DOI
18 Ma, L., Zhang, Y., Sunderraman, R., Fox, P., Laird, A., Turner, J., & Turner, M. (2015). Hybrid feature selection methods for online biomedical publication classification. 2015 IEEE Conference on Computational Intelligence in Bioinformatics and Computational Biology, Canada, 1-8. https://doi.org/10.1109/CIBCB.2015.7300320
19 Kwon, A. (2013). Variable selection using Random Forest. unpublished master's thesis, Inha University.
20 Kang, S., Jeon, H., Kim, J., & Song, J. (2015). A study on domestic drama rating prediction. The Korean Journal of Applied Statistics, 28(5), 933-949. http://dx.doi.org/10.5351/KJAS.2015.28.5.933   DOI
21 Yoo, J. (2015). Random forests, an alternative data mining technique to decision tree. Journal of Educational Evaluation, 28(2), 427-448.
22 Kim, Pan Jun (2018). An analytical study on automatic classification of domestic journal articles based on machine learning. Journal of the Korean Society for information Management, 35(2), 37-62. https://doi.org/10.3743/KOSIM.2018.35.2.037   DOI
23 Nam, S., Oh, M., Kim, S., Kang, C., Kim, G., & Choi, S. (2017). Comparison of machine learning models for classification into user-oriented groups. Journal of the Korean Data Analysis Society, 19(5), 2501-1507.   DOI
24 Suh, J. (2016). Foreign exchange rate forecasting using the GARCH extended Random Forest model. Journal of Industrial Economics and Business, 29(5), 1607-1628.
25 Yun, Taegyun, & Yi, Gwan-Su (2008). Application of Random Forest algorithm for the decision support system of medical diagnosis with the selection of significant clinical test. The Transactions of the Korean Institute of Electrical Engineers, 57(6), 1058-1062.
26 Lee, Jae-Yun (2005). An empirical study on improving the performance of text categorization considering the relationships between feature selection criteria and weighting methods. Journal of the Korean Society for Library and Information Science, 39(2), 123-146. http://dx.doi.org/10.4275/kslis.2005.39.2.123   DOI
27 Kim, M. J., Kang, D. K., &. Kim, H. B. (2015). Geometric mean based boosting algorithm with over-sampling to resolve data imbalance problem for bankruptcy prediction. Expert Systems with Applications, 42(3), 1074-1082. https://doi.org/10.1016/j.eswa.2014.08.025   DOI
28 Dogan, T., & Uysal, A. K. (2018). The impact of feature selection on urban land cover classification. International Journal of Intelligent Systems and Applications in Engineering(IJISAE), 6(1), 59-64. http://doi.org/10.18201/ijisae.2018637933   DOI
29 Fawagreh, K., Gaber, M. M., & Elyan, E. (2014). Random forests: From early developments to recent advancements. Systems Science & Control Engineering, 2(1), 602-609. http://doi.org/10.1080/21642583.2014.956265   DOI
30 Gao, D., Zhang, Y., & Zhao, Y. (2009). Random forest algorithm for classification of multiwavelength data. Research in Astronomy and Astrophysics, 9(2), 220-226. http://doi.org/101088/1674-4527/9/2/011   DOI
31 Klassen, M., & Paturi, N. (2010). Web document classification by keywords using Random Forests. In: Zavoral F., Yaghob J., Pichappan P., El-Qawasmeh E. (eds) Networked Digital Technologies. NDT 2010. Communications in Computer and Information Science, vol 88. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-14306-9_26
32 Kong, Q., Gong, H., Ding, X., & Hou, R. (2017). Classification application based on mutual information and Random Forest method for high dimensional data. 9th International Conference on Intelligent Human-Machine Systems and Cybernetics (IHMSC), Hangzhou, 171-174. https://doi.org/10.1109/IHMSC.2017.45
33 Brandenburg, Minke (2017). Text classification of Dutch police records. Unpublished master's thesis, Utrecht University Artificial Intelligence, Netherlands.
34 Madjarov, G., Kocev, D., Gjorgjevikj, D., & Dzeroski, S. (2012). An extensive experimental comparison of methods for multi-label learning. Pattern Recognition, 45, 3084-3104. https://doi.org/10.1016/j.patcog.2012.03.004   DOI
35 Aung, W. T., Myanmar, Y., & Hla, K. H. M. S. (2009). Random forest classifier for multi-category classification of web pages. In Services Computing Conference, APSCC 2009. IEEE Asia-Pacific, 372-376. http://doi.org/10.1109/APSCC.2009.5394100
36 Austin, P. C., Tu, J. V., Ho, J. E., Levy, D., & Lee, D. S. (2013). Using methods from the data-mining and machine-learning literature for disease classification and prediction: A case study examining classification of heart failure subtypes. Journal of Clinical Epidemiology, 66(4), 398-407. http://doi.org/10.1016/j.jclinepi.2012.11.008   DOI
37 Berk, R., Li, A., & Hickman, L. J. (2005). Statistical difficulties in determining the role of race in capital cases: A re-analysis of data from the state of Maryland. Journal of Quantitative Criminology, 21(4), 365-390. https://doi.org/10.1007/s10940-005-7354-7   DOI
38 Boinee, P., Angelis, A. D., & Foresti, G. L. (2005). Meta random forests. International Journal of Computational Intelligence, 2(3), 138-147.
39 Brown, I., & Mues, C. (2012). An experimental comparison of classification algorithms for imbalanced credit scoring data sets. Expert Systems with Applications, 39(3), 3446-3453. http://doi.org/10.1016/j.eswa.2011.09.033   DOI
40 Choi, S., & Kim, H. (2016). Tree size determination for classification ensemble. Journal of the Korean Data &Information Science Society, 27(1), 255-264. http://dx.doi.org/10.7465/jkdi.2016.27.1.255   DOI
41 Cutler, D. R., Edwards, T. C., Beard, K. H., Cutler, A., Hess, K. T., Gibson, J., & Lawler, J. J. (2007). Random forests for classification in ecology. Ecology, 88(11), 2783-2792. http://doi.org/10.1890/07-0539.1   DOI
42 Trieschnigg, D., Pezik, P., Lee, V., Jong, F. D., Kraaij, W., & Rebholz-Schuhmann, D. (2009). MeSH Up: Effective MeSH text classification for improved document retrieval. Bioinformatics, 25(11), 1412-1418. https://doi.org/10.1093/bioinformatics/btp249   DOI
43 Latinne, P., Debeir, O., & Decaestecker, C. (2001). Limiting the number of trees in Random Forests. In: Kittler J., Roli F. (eds) Multiple Classifier Systems. MCS 2001. Lecture Notes in Computer Science, vol 2096. Springer, Berlin, Heidelberg, 178-187. https://doi.org/10.1007/3-540-48219-9_18
44 Nayak, S., Ramesh, R., & Shah, S. (2013). A study of multi-label text classification and the effect of label hierarchy. CS224N Project Report, USA: Stanford University. retrieved from https://nlp.stanford.edu/courses/cs224n/2013/reports/nayak.pdf
45 Robnik-Sikonja M. (2004). Improving Random Forests. In: Boulicaut JF., Esposito F., Giannotti F., Pedreschi D. (eds) Machine Learning: ECML 2004. ECML 2004. Lecture Notes in Computer Science, vol 3201. Springer, Berlin. https://doi.org/10.1007/978-3-540-30115-8_34
46 Roul, R. K., & Rai, P. (2016). A new feature selection technique combined with elm feature space for text classification. In Proceedings of the 13th International Conference on Natural Language Processing, 285-292.
47 Siroky, D. S. (2009). Navigating random forests and related advances in algorithmic modeling. Statistics Surveys, 3, 147-163.   DOI
48 Turner, M. D., Chakrabarti, C., Jones, T. B., Xu, J. F., Fox, P. T., Luger, G. F., Laird, A. R., & Turner, J. A. (2013). Automated annotation of functional imaging experiments via multi-label classification. Frontiers in neuroscience, 7, 240. http://doi.org/10.3389/fnins.2013.00240   DOI
49 Lee, C., Yoo, K., Mun, B., & Bae, S. (2017). Informal quality data analysis via sentimental analysis and Word2vec method. Journal of Korean Society for Quality Management, 45(1), 117-127. http://dx.doi.org/10.7469/JKSQM.2017.45.1.117   DOI
50 Tsymbal, A., Pechenizkiy, M., & Cunningham, P. (2006) Dynamic integration with random forests. In: Furnkranz, J., Scheffer, T., Spiliopoulou, M. (eds) Machine Learning: ECML 2006. ECML 2006. Lecture Notes in Computer Science, vol 4212. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11871842_82
51 Manning, Christopher, Raghavan, & Prabhakar (2008). Introduction to information retrieval. NY, USA: Cambridge University Press.
52 Yao, D., Yang, J., & Zhan, X. (2013). An improved random forest algorithm for class-imbalanced data classification and its application in PAD risk factors analysis. The Open Electrical & Electronic Engineering Journal, 7, (Supple 1: M7) 62-70. http://dx.doi.org/10.2174/1874129001307010062   DOI
53 Ward, M. S., Pajevic, J., Dreyfuss, J., & Malley, J. (2006). Short-term prediction of mortality in patients with systemic lupus erythematosus: Classification of outcomes using random forests. Arthritis and Rheumatism, 55(1), 74-80. http:/doi.org/10.1002/art.21695   DOI
54 Wu, Q., Ye, Y., Zhang, H., Ng, M. K., & Ho, Shen-Shyang. (2014). Fores texter: An efficient random forest algorithm for imbalanced text categorization. Knowledge-Based System, 67, 105-116. http://doi.org/10.1016/j.knosys.2014.06.004   DOI
55 Xu, B., Guo, X., Ye, Y., & Cheng, J. (2012). An improved random forest classifier for text categorization. Journal of Computers, 7(12), 2913-2920. http://dx.doi.org/10.4304/jcp.7.12.2913-2920.
56 Xu, B., Huang, J. Z., Williams, G., & Ye, Y. (2012). Hybrid weighted random forests for classifying very high dimensional data. International Journal of Data Warehousing and Mining, 8(2), 44-63. http://dx.doi.org/10.4018/jdwm.2012040103   DOI
57 Yang, Y., & Pedersen, J. O. (1997). A comparative study on feature selection in text categorization. In Proceedings of the Fourteenth International Conference on Machine Learning, July 08-12, 412-420.
58 Zhou Q., Zhou H., & Li, T. (2016). Cost-sensitive feature selection using random forest: Selecting low-cost subsets of informative features. Knowledge-Based Systems, 95, 1-11. https://doi.org/10.1016/j.knosys.2015.11.010   DOI
59 Breiman L. (2002). Random forests. Machine Learning, 45(1), 5-32.   DOI