[KSCI] Korea Science Citation Index Service

http://dx.doi.org/10.3837/tiis.2018.08.010

A Novel Feature Selection Method in the Categorization of Imbalanced Textual Data

Pouramini, Jafar (Department of Computer and Information Technology Engineering, Faculty of Engineering, University of Qom)
Minaei-Bidgoli, Behrouze (Faculty of Computer Engineering, Iran University of Science and Technology)
Esmaeili, Mahdi (Faculty of Computer Engineering, Kashan Islamic Azad University)

Publication Information

KSII Transactions on Internet and Information Systems (TIIS) / v.12, no.8, 2018 , pp. 3725-3748 More about this Journal

Abstract

Text data distribution is often imbalanced. Imbalanced data is one of the challenges in text classification, as it leads to the loss of performance of classifiers. Many studies have been conducted so far in this regard. The proposed solutions are divided into several general categories, include sampling-based and algorithm-based methods. In recent studies, feature selection has also been considered as one of the solutions for the imbalance problem. In this paper, a novel one-sided feature selection known as probabilistic feature selection (PFS) was presented for imbalanced text classification. The PFS is a probabilistic method that is calculated using feature distribution. Compared to the similar methods, the PFS has more parameters. In order to evaluate the performance of the proposed method, the feature selection methods including Gini, MI, FAST and DFS were implemented. To assess the proposed method, the decision tree classifications such as C4.5 and Naive Bayes were used. The results of tests on Reuters-21875 and WebKB figures per F-measure suggested that the proposed feature selection has significantly improved the performance of the classifiers.

Keywords

Feature selection; Imbalanced class; High dimensionality; Text classification;

Citations & Related Records

Reference

1	A. Y. Ng, "Feature selection, L1 vs. L2 regularization, and rotational invariance," in Proc. of presented at the Proceedings of the twenty-first international conference on Machine learning, Banff, Alberta, Canada, 2004.
2	M. Alibeigi, S. Hashemi, and A. Hamzeh, "DBFS: An effective Density Based Feature Selection scheme for small sample size and high dimensional imbalanced data sets," Data & Knowledge Engineering, vol. 81-82, pp. 67-103, 2012. DOI
3	S. Maldonadoa, R. Weberb, and F. Famili, "Feature selection for high-dimensional class-imbalanced data sets using Support Vector Machines," National Research Council of Canada, Ottawa, Canada Information Sciences, vol. 286, pp. 228-246, 2014.
4	J. Pouramini and B. Minaei-Bidgoli, "A New Synthetic Oversampling Method Using Ontology and Feature Selection in Order to Improve Imbalanced Textual Data Classification in Persian Texts," Bulletin de la Societe Royale des Sciences de Liege, vol. 85, pp. 358-375, 2016.
5	E. Chen, Y. Lin, H. Xiong, Q. Luo, and H. Ma, "Exploiting probabilistic topic models to improve text categorization under class imbalance," Information Processing & Management, vol. 47, pp. 202-214, 2011. DOI
6	E. L. Iglesias, A. Seara Vieira, and L. Borrajo, "An HMM-based over-sampling technique to improve text classification," Expert Systems with Applications, vol. 40, pp. 7184-7192, 2013. DOI
7	A. Sun, E.-P. Lim, and Y. Liu, "On strategies for imbalanced text classification using SVM: A comparative study," Decision Support Systems, vol. 48, pp. 191-201, 2009. DOI
8	R. Barandela, R. M. Valdovinos, J .S. Sánchez, and F. J. Ferri, "The imbalanced training sample problem: Under or over sampling?," in Proc. of Joint IAPR International Workshops on Statistical Techniques in Pattern Recognition (SPR) and Structural and Syntactic Pattern Recognition (SSPR), pp.814-806, 2004.
9	N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, "SMOTE: synthetic minority over-sampling technique," Journal of artificial intelligence research, vol. 16, pp. 321-357, 2002. DOI
10	S. Barua, M. M. Islam, X. Yao, and K. Murase,"MWMOTE--majority weighted minority oversampling technique for imbalanced data set learning," Knowledge and Data Engineering, IEEE Transactions on, vol. 26, pp. 405-425, 2014. DOI
11	M. Craven, D. DiPasquo, D. Freitag, A. McCallum, T. Mitchell, K. Nigam, et al., "Learning to extract symbolic knowledge from the World Wide Web," in Proc. of presented at the Proceedings of the fifteenth national/tenth conference on Artificial intelligence/Innovative applications of artificial intelligence, Madison, Wisconsin, USA, 1998.
12	H. Jing, B. Wang, Y. Yang, and Y. Xu, "A General Framework of Feature Selection for Text Categorization," in Proc. of Machine Learning and Data Mining in Pattern Recognition: 6th International Conference, MLDM 2009, Leipzig ,Germany, July 23-25, 2009. Proceedings, P. Perner, Ed., ed Berlin, Heidelberg: Springer Berlin Heidelberg, pp. 647-66, 2009.
13	Y. Sun, M. S. Kamel, A. K. Wong, and Y. Wang, "Cost-sensitive boosting for classification of imbalanced data," Pattern Recognition, vol. 40, pp. 3358-3378, 2007. DOI
14	K. Bache and M. Lichman, "UCI machine learning repository," ed, 2013.
15	X.-w. Chen and M. Wasikowski, "FAST: a roc-based feature selection metric for small samples and imbalanced data classification problems," in Proc. of presented at the Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, Las Vegas, Nevada, USA, 2008.
16	Y. Liu, H. T. Loh, and A. Sun, "Imbalanced text classification: A term weighting approach," Expert Systems with Applications, vol. 36, pp. 690-701, 2009. DOI
17	C. Sanchez-Hernandez, D. S. Boyd, and G. M. Foody, "One-class classification for mapping a specific land-cover class: SVDD classification of fenland," IEEE Transactions on Geoscience and Remote Sensing, vol. 45, pp. 1061-1073, 2007. DOI
18	S. S. Khan and M. G. Madden, "A survey of recent trends in one class classification," in Proc. of Irish conference on Artificial Intelligence and Cognitive Science, pp. 188-197, 2009.
19	K. M. Ting, "A comparative study of cost-sensitive boosting algorithms," in Proc. of Proceedings of the 17th International Conference on Machine Learning, 2000.
20	F. Cheng, J. Zhang, C. Wen, Z. Liu, and Z. Li, "Large cost-sensitive margin distribution machine for imbalanced data classification," Neurocomputing, vol. 224, pp. 45-57, 2017. DOI
21	Y. Xu, "A Comparative Study on Feature Selection in Unbalance Text Classification," in Proc. of presented at the Proceedings of the 2012 Fourth International Symposium on Information Science and Engineering, 2012.
22	H. Liu and L. Yu, "Toward integrating feature selection algorithms for classification and clustering," IEEE Transactions on knowledge and data engineering, vol. 17, pp. 491-502, 2005. DOI
23	S. Chua and N. Kulathuramaiyer, "Feature selection semantic based," Springer Netherlands, Innovations and Advanced Techniques in Systems, Computing Sciences and Software Engineering, pp. 471-476, 2008.
24	R. K. Roul, A. Bhalla, and A. Srivastava, "Commonality-Rarity Score Computation: A novel Feature Selection Technique using Extended Feature Space of ELM for Text Classification," in Proc. of presented at the Proceedings of the 8th annual meeting of the Forum on Information Retrieval Evaluation, Kolkata, India, 2016.
25	A. Khan, B. Baharudin, and K. Khan, "Efficient Feature Selection and Domain Relevance Term Weighting Method for Document Classification," IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, vol. 2, pp. 398-403, 2010.
26	W. Zong, F. Wu, L.-K. Chu, and D. Sculli, "A discriminative and semantic feature selection method for text categorization," International Journal of Production Economics, vol. 165, pp. 215-222, 2015. DOI
27	A. Rehman, K. Javed, and H. A. Babri, "Feature selection based on a normalized difference measure for text classification," Information Processing & Management, vol. 53, pp. 473-489, 2017. DOI
28	A. Rehman, K. Javed, H. A. Babri, and M. Saeed, "Relative discrimination criterion - A novel feature ranking method for text data," Expert Systems with Applications, vol. 42, pp. 3670-3681, 2015. DOI
29	Y. Wang, Y. Liu, L. Feng, and X. Zhu, "Novel feature selection method based on harmony search for email classification," Knowledge-Based Systems, vol. 73, pp. 311-323, 2015. DOI
30	M. Wasikowski and X.-w. Chen, "Combating the small sample class imbalance problem using feature selection," IEEE Transactions on knowledge and data engineering, vol. 22, pp. 1388-1400, 2010. DOI
31	W. Shang, H. Huang, H. Zhu, Y. Lin, Y. Qu, and Z. Wang, "A novel feature selection algorithm for text categorization," Expert Systems with Applications, vol. 33, pp. 1-5, 2007. DOI
32	A. K. Uysal and S. Gunal, "A novel probabilistic feature selection method for text classification," Knowledge-Based Systems, vol. 36, pp. 226-235, 2012. DOI
33	N. V. Chawla, N. Japkowicz, and A. Kotcz, "Editorial: special issue on learning from imbalanced data sets," SIGKDD Explor. Newsl., vol. 6, pp. 1-6, 2004. DOI
34	H. He and E. A. Garcia, "Learning from Imbalanced Data," IEEE Transactions on Knowledge and Data Engineering, vol. 21, pp. 1263-1284, 2009. DOI
35	P. Yang, W. Liu, B. B. Zhou, S. Chawla, and A. Y. Zomaya, "Ensemble-based wrapper methods for feature," springer, Advances in Knowledge Discovery and Data Mining, vol. 7818, pp. 544-555, 2013.
36	M. Galar, A. Fernandez, E. Barrenechea, H. Bustince, and F. Herrera, "A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches," IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), vol. 42, pp. 463-484, 2012. DOI
37	J. V. Hulse, T. M. Khoshgoftaar, and A. Napolitano, "Experimental perspectives on learning from imbalanced data," in Proc. of presented at the Proceedings of the 24th international conference on Machine learning, Corvalis, Oregon, USA, 2007.
38	F. Sebastiani, "Machine learning in automated text categorization," ACM computing surveys (CSUR), vol. 34, pp. 1-47, 2002. DOI
39	G. Forman, "An extensive empirical study of feature selection metrics for text classification," Journal of machine learning research, vol. 3, pp. 1289-1305, 2003.
40	H. Ogura, H. Amano, and M. Kondo, "Comparison of metrics for feature selection in imbalanced text classification," Expert Systems with Applications, vol. 38, pp. 4978-4989, 2011. DOI
41	G. S. Yanling Li and Y. Zhu, "Data imbalance problem in text classification," in Proc. of IEEE ,Third International Symposium on Information Processing, 2010.
42	Z. Zheng and R. S. X Wu, "Feature Selection for Text Categorization on Imbalanced Data," ACM SIGKDD Explorations Newsletter, 2004.
43	I. Guyon, S. Gunn, M. Nikravesh, and L. A. Zadeh, Feature extraction: foundations and applications vol. 207: Springer, 2008.
44	Y. Liu, G. Wang, H. Chen, H. Dong, X. Zhu, and S. Wang, "An Improved Particle Swarm Optimization for Feature Selection," Journal of Bionic Engineering, vol. 8, pp. 191-200, 2011. DOI
45	A. K. Uysal and S. Gunal" ,Text classification using genetic algorithm oriented latent semantic features," Expert Systems with Applications, vol. 41, pp. 5938-5947, 2014. DOI
46	A. Moayedikia, K.-L. Ong, Y. L. Boo, W. G. S. Yeoh, and R. Jensen, "Feature selection for high dimensional imbalanced class data using harmony search," Engineering Applications of Artificial Intelligence, vol. 57, pp. 38-49, 2017. DOI