An Experimental Study on Feature Ranking Schemes for Text Classification

Pan Jun Kim;

doi:10.3743/KOSIM.2023.40.1.001

정보관리학회지 (Journal of the Korean Society for information Management)

제40권1호
/
Pages.1-21
/
2023
/
1013-0799(pISSN)
/
2586-2073(eISSN)

한국정보관리학회 (Korean Society for Information Management)

DOI QR Code

텍스트 분류를 위한 자질 순위화 기법에 관한 연구

An Experimental Study on Feature Ranking Schemes for Text Classification

김판준 (신라대학교 문헌정보학과)

Pan Jun Kim

투고 : 2023.02.01
심사 : 2023.03.17
발행 : 2023.03.30

https://doi.org/10.3743/KOSIM.2023.40.1.001 인용 PDF

PDF 다운로드

⟨ 이전 논문 다음 논문 ⟩

초록

본 연구는 텍스트 분류를 위한 효율적인 자질선정 방법으로 자질 순위화 기법의 성능을 구체적으로 검토하였다. 지금까지 자질 순위화 기법은 주로 문헌빈도에 기초한 경우가 대부분이며, 상대적으로 용어빈도를 사용한 경우는 많지 않았다. 따라서 텍스트 분류를 위한 자질선정 방법으로 용어빈도와 문헌빈도를 개별적으로 적용한 단일 순위화 기법들의 성능을 살펴본 다음, 양자를 함께 사용하는 조합 순위화 기법의 성능을 검토하였다. 구체적으로 두 개의 실험 문헌집단(Reuters-21578, 20NG)과 5개 분류기(SVM, NB, ROC, TRA, RNN)를 사용하는 환경에서 분류 실험을 진행하였고, 결과의 신뢰성 확보를 위해 5-fold cross validation과 t-test를 적용하였다. 결과적으로, 단일 순위화 기법으로는 문헌빈도 기반의 단일 순위화 기법(chi)이 전반적으로 좋은 성능을 보였다. 또한, 최고 성능의 단일 순위화 기법과 조합 순위화 기법 간에는 유의한 성능 차이가 없는 것으로 나타났다. 따라서 충분한 학습문헌을 확보할 수 있는 환경에서는 텍스트 분류의 자질선정 방법으로 문헌빈도 기반의 단일 순위화 기법(chi)을 사용하는 것이 보다 효율적이라 할 수 있다.

This study specifically reviewed the performance of the ranking schemes as an efficient feature selection method for text classification. Until now, feature ranking schemes are mostly based on document frequency, and relatively few cases have used the term frequency. Therefore, the performance of single ranking metrics using term frequency and document frequency individually was examined as a feature selection method for text classification, and then the performance of combination ranking schemes using both was reviewed. Specifically, a classification experiment was conducted in an environment using two data sets (Reuters-21578, 20NG) and five classifiers (SVM, NB, ROC, TRA, RNN), and to secure the reliability of the results, 5-Fold cross-validation and t-test were applied. As a result, as a single ranking scheme, the document frequency-based single ranking metric (chi) showed good performance overall. In addition, it was found that there was no significant difference between the highest-performance single ranking and the combination ranking schemes. Therefore, in an environment where sufficient learning documents can be secured in text classification, it is more efficient to use a single ranking metric (chi) based on document frequency as a feature selection method.

키워드

참고문헌

Han, Ji Yeong & Heo, Go Eun (2021). Analyzing students' non-face-to-face course evaluation by topic modeling and developing deep learning-based classification model. Journal of the Korean Society for Library and Information Science, 55(4), 267-291. http://dx.doi.org/10.4275/KSLIS.2021.55.4.267
Kim, In Hu & Kim, Seong hee (2022). Automatic classification of academic articles using BERT model based on deep learning. Journal of the Korean Society for Information Management, 39(3), 293-310. http://dx.doi.org/10.3743/KOSIM.2022.39.3.293
Kim, Pan Jun (2008). A study on the performance improvement of rocchio classifier with term weighting methods. Journal of the Korean Society for Information Management, 25(1), 211-233. http://dx.doi.org/10.3743/KOSIM.2008.25.1.211
Kim, Pan Jun (2016). An analytical study on performance factors of automatic classification based on machine learning. Journal of the Korean Society for information Management, 33(2), 33-59. http://dx.doi.org/10.3743/KOSIM.2016.33.2.033
Kim, Pan Jun (2018). An analytical study on automatic classification of domestic journal articles based on machine learning. Journal of the Korean Society for Information Management, 35(2), 37-62. https://doi.org/10.3743/KOSIM.2018.35.2.037
Kim, Pan Jun (2022). An experimental study on the automatic classification of korean journal articles through feature selection. Journal of the Korean Society for Information Management, 39(1), 69-90. http://dx.doi.org/10.3743/KOSIM.2022.39.1.069
Lee, Jae-Yun (2005). An empirical study on improving the performance of text categorization considering the relationships between feature selection criteria and weighting methods. Journal of the Korean Society for Library and Information Science, 39(2), 123-146. http://dx.doi.org/10.4275/kslis.2005.39.2.123
Yuk, JeeHee & Song, Min (2018). A study of research on methods of automated biomedical document classification using topic modeling and deep learning. Journal of the Korean Society for Information Management, 35(2), 63-88. http://dx.doi.org/10.3743/KOSIM.2018.35.2.063
Abiodun, E. O., Alabdulatif, A., Abiodun, O. I., Alawida, M., Alabdulatif, A., & Alkhawaldeh, R. S. (2021). A systematic review of emerging feature selection optimization methods for optimal text classification: the present state and prospective opportunities. Neural Computing & Applications, 33(4), 1-28. https://doi.org/10.1007/s00521-021-06406-8
Aggarwal, C. C. & Zhai, C. (2012). A Survey of Text Classification Algorithms. In: Aggarwal, C., Zhai, C. (eds) Mining Text Data. https://doi.org/10.1007/978-1-4614-3223-4_6
Agnihotri, D., Verma, K., & Tripathi, P. (2017). Variable global feature selection scheme for automatic classification of text documents. Expert Systems with Applications, 81, 268-281. https://doi.org/10.1016/j.eswa.2017.03.057
Avila-Arguelles, R., Calvo, H., Gelbukh, A., & Godoy-Calderon, S. (2010). Assigning Library of Congress Classification codes to books based only on their titles. Informatica, 34(1), 77-84.
Azam, N. & Yao, J. (2012). Comparison of term frequency and document frequency based feature selection metrics in text categorization. Expert Systems with Applications, 39(5), 4760-4768. https://doi.org/10.1016/j.eswa.2011.09.160
Baccianella, S., Esuli, A., & Sebastiani, F. (2013). Using micro-documents for feature selection: The case of ordinal text classification. Expert Systems with Applications, 40(11), 4687-4696. https://doi.org/10.1016/j.eswa.2013.02.010
Bolon-Canedo, V. & Alonso-Betanzos, A. (2019). Ensembles for feature selection: A review and future trends. Information Fusion, 52, 1-12. https://doi.org/10.1016/j.inffus.2018.11.008
Cai, J., Luo, J., Wang, S., & Yang, S. (2018). Feature selection in machine learning: A new perspective. Neurocomputing, 300, 70-79. https://doi.org/10.1016/j.neucom.2017.11.077
Cai, Z. & Zhu, W. (2018). Multi-label feature selection via feature manifold learning and sparsity regularization. International journal of machine learning and cybernetics, 9(8), 1321-1334. https://doi.org/10.1007/s13042-017-0647-y
Chang, F., Guo, J., Xu, W., & Yao, K. (2015). A Feature Selection Method to Handle Imbalanced Data in Text Classification. Journal of Digital Information Management, 13, 169-175.
Chen, J., Huang, H., Tian, S., & Qu, Y. (2009). Feature selection for text classification with Naive Bayes. Expert Systems with Applications, 36(3), 5432-5435. https://doi.org/10.1016/j.eswa.2008.06.054
Cunha, W., Mangaravite, V., Gomes, C., Canuto, S., Resende, E., Nascimento, C., Viegas, F., Franca, C., Martins, W. S., Almeida, J. M., Rosa, T., Rocha, L., & Goncalves, M. A. (2021). On the cost-effectiveness of neural and non-neural approaches and representations for text classification: A comprehensive comparative study. Information Processing & Management, 58(3), 102481. https://doi.org/10.1016/j.ipm.2020.102481
Dash, M. & Liu, H. (1997). Feature selection for classification. Intelligent data analysis, 1, 131-156. https://doi.org/10.1016/S1088-467X(97)00008-5
Deng, X., Li, Y., Weng, J., & Zhang, J. (2019). Feature selection for text classification: A review. Multimedia Tools and Applications, 78, 3797-3816. https://doi.org/10.1007/s11042-018-6083-5
Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. https://arxiv.org/abs/1810.04805
Forman, G. (2003). An extensive empirical study of feature selection metrics for text classification. Journal of Machine Learning Research, 3, 1289-1305.
Gunal, S. (2012). Hybrid feature selection for text classification. Turkish Journal of Electrical Engineering and Computer Science, 20(Sup.2), 1296-1311. https://doi.org/10.3906/elk-1101-1064
Guyon, I. & Elisseeff, A. (2003). An introduction to variable and feature selection. The Journal of Machine Learning Research, 3, 1157-1182.
Guyon, I., Weston, J., Barnhill, S., & Vapnik, V. (2002). Gene selection for cancer classification using support vector machines. Machine Learning, 46(1), 389-422. https://doi.org/10.1023/A:1012487302797
Han, E. H. & Karypis, G. (2000). Centroid-based document classification: Analysis and experimental results. In European conference on principles of data mining and knowledge discovery, 421-431. https://doi.org/10.1007/3-540-45372-5_46
Harish, B. & Revanasiddappa, M. (2017). A comprehensive survey on various feature selection methods to categorize text documents. International Journal of Computer Applications, 164, 1-7. http://doi.org/10.5120/ijca2017913711
Iqbal, M., Abid, M. M., Khalid, M. N., & Manzoor, A. (2020). Review of feature selection methods for text classification. International Journal of Advanced Computer Research, 10(49), 138-152. http://dx.doi.org/10.19101/IJACR.2020.1048037
Javed, K., Babri, H. A., & Saeed, M. (2010). Feature selection based on class-dependent densities for high-dimensional binary data. IEEE Transactions on Knowledge and Data Engineering, 24(3), 465-477. http://dx.doi.org/10.1109/TKDE.2010.263
Joachims, T. (1996). A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization. Carnegie-Mellon University Dept of Computer Science. Available: https://apps.dtic.mil/sti/citations/ADA307731
Joachims, T. (2002). Learning to classify text using support vector machines: Methods, theory and algorithms. Massachusetts: Kluwer Academic Publishers.
Kohavi, R. & John, G. H. (1997). Wrappers for feature subset selection. Artificial intelligence, 97(1-2), 273-324. https://doi.org/10.1016/S0004-3702(97)00043-X
Kumar, V. & Minz, S. (2014). Feature selection: a literature review. Smart Computing Review, 4(3), 211-229. htts://doi.org/10.6029/smartcr.2014.03.007
Lan, M., Tan, C. L., Su, J., & Lu, Y. (2008). Supervised and traditional term weighting methods for automatic text categorization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31(4), 721-735. https://doi.org/10.1109/TPAMI.2008.110
Lazar, C., Taminau, J., Meganck, S., Steenhoff, D., Coletta, A., Molter, C., De Schaetzen, V., Duque, R., Bersini, H., & Nowe, A. (2012). A survey on filter techniques for feature selection in gene expression microarray analysis. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 9(4), 1106-1119. https://doi.org/10.1109/TCBB.2012.33
Li, Y., Li, T., & Liu, H. (2017). Recent advances in feature selection and its applications. Knowledge and Information Systems, 53(3), 551-577. https://doi.org/10.1007/s10115-017-1059-8
Liu, H. & Yu, L. (2005). Toward integrating feature selection algorithms for classification and clustering. IEEE Transactions on Knowledge and Data Engineering, 17(4), 491-502. https://doi.org/10.1109/TKDE.2005.66
Mesleh, A. M. (2011). Feature sub-set selection metrics for arabic text classification. Pattern Recognition Letters, 32(14), 1922-1929. https://doi.org/10.1016/j.patrec.2011.07.010
Parlak, B. & Uysal, A. K. (2021). A novel filter feature selection method for text classification: Extensive Feature Selector. Journal of Information Science, 49(1), 59-78. https://doi.org/10.1177/0165551521991037
Pinheiro, R. H., Cavalcanti, G. D., & Ren, T. I. (2015). Data-driven global-ranking local feature selection methods for text categorization. Expert Systems with Applications, 42(4), 1941-1949. https://doi.org/10.1016/j.eswa.2014.10.011
Pintas, J. T., Fernandes, L. A. F., & Garcia, A. C. B. (2021). Feature selection methods for text classification: a systematic literature review. Artificial Intelligence Review, 54, 6149-6200. https://doi.org/10.1007/s10462-021-09970-6
Rehman, A., Javed, K., & Babri, H. A. (2017). Feature selection based on a normalized difference measure for text classification. Information Processing & Management, 53(2), 473-489. https://doi.org/10.1016/j.ipm.2016.12.004.
Rehman, A., Javed, K., Babri, H. A., & Asim, N. (2018). Selection of the most relevant terms based on a max-min ratio metric for text classification. Expert Systems with Applications, 114, 78-96. https://doi.org/10.1016/j.eswa.2018.07.028
Sebastiani, F. (2002). Machine learning in automated text categorization. ACM Computing Surveys, 34(1), 1-47. https://doi.org/10.1145/505282.505283
Shang, W., Huang, H., Zhu, H., Lin, Y., Qu, Y., & Wang, Z. (2007). A novel feature selection algorithm for text categorization. Expert Systems with Applications, 33(1), 1-5. https://doi.org/10.1016/j.eswa.2006.04.001
Su, J., Shirab, J. S., & Matwin, S. (2011). Large scale text classification using semi-supervised multinomial naive bayes. In Proceedings of the 28th International Conference on International Conference on Machine Learning (ICML'11), 97-104. Available: http://www.icml-2011.org/papers/93_icmlpaper.pdf
Talavera, L. (2005). An evaluation of filter and wrapper methods for feature selection in categorical clustering. In: Famili, A. F., Kok, J.N ., Pena, J. M., Siebes, A., Feelders, A. (eds) Advances in intelligent data analysis VI. IDA 2005. Lecture Notes in Computer Science, 3646. https://doi.org/10.1007/11552253_40
Uysal, A. K. (2016). An improved global feature selection scheme for text classification. Expert Systems with Applications, 43(1), 82-92. https://doi.org/10.1016/j.eswa.2015.08.050
Van Hulse, J., Khoshgoftaar, T. M., & Napolitano, A. (2011). A comparative evaluation of feature ranking methods for high dimensional bioinformatics data. In 2011 IEEE International Conference on Information Reuse & Integration, 2011, 315-320. https://doi.org/10.1109/IRI.2011.6009566
Venkatesh, B. & Anuradha, J. (2019). A review of feature selection and its methods. Cybernetics and Information Technologies, 19(1), 3-26. https://doi.org/10.2478//cait-2019-0001
Wang, D, Zhang, H., Liu, R., & Lv, W. (2012). Feature selection based on term frequency and T-test for text categorization. IProceedings of the 21st ACM International Conference on Information and Knowledge Management, 1482-1486. https://doi.org/10.1145/2396761.2398457
Wang, D., Zhang, H., Liu, R., Liu, X., & Wang, J. (2016). Unsupervised feature selection through gram-Schmidt orthogonalization-A word co-occurrence perspective. Neurocomputing, 173(P3), 845-854. https://doi.org/10.1016/j.neucom.2015.08.038
Wang, D., Zhang, H., Liu, R., Lv, W., & Wang, D. (2014). t-test feature selection approach based on term frequency for text categorization. Pattern Recognition Letters, 45, 1-10. https://doi.org/10.1016/j.patrec.2014.02.013
Wang, H. & Hong, M. (2019). Supervised Hebb rule based feature selection for text classification. Information Processing & Management, 56(1), 167-191. https://doi.org/10.1016/j.ipm.2018.09.004
Wu, G. & Xu, J. (2015). Optimized approach of feature selection based on information gain. In 2015 International Conference on Computer Science and Mechanical Automation, 157-161. https://doi.org/10.1109/CSMA.2015.38
Wu, Y. & Zhang, A. (2004). Feature selection for classifying high-dimensional numerical data. Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004, CVPR 2004, 2, 251-258. http://doi.org/10.1109/CVPR.2004.1315171
Yang, Y. & Pedersen. J. O. (1997). A comparative study on feature selection in text categorization. Proceedings of the Fourteenth International Conference on Machine Learning, 412-420.
Yao, H., Liu, C., Zhang, P., & Wang, L. (2017). A feature selection method based on synonym merging in text classification system. EURASIP Journal on Wireless Communications and Networking, 2017(1), 1-8. https://doi.org/10.1186/s13638-017-0950-z

정보관리학회지 (Journal of the Korean Society for information Management)

텍스트 분류를 위한 자질 순위화 기법에 관한 연구

An Experimental Study on Feature Ranking Schemes for Text Classification

초록

키워드

참고문헌

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

자세히 찾기

이미지 검색 (β)