Browse > Article
http://dx.doi.org/10.3837/tiis.2018.08.010

A Novel Feature Selection Method in the Categorization of Imbalanced Textual Data  

Pouramini, Jafar (Department of Computer and Information Technology Engineering, Faculty of Engineering, University of Qom)
Minaei-Bidgoli, Behrouze (Faculty of Computer Engineering, Iran University of Science and Technology)
Esmaeili, Mahdi (Faculty of Computer Engineering, Kashan Islamic Azad University)
Publication Information
KSII Transactions on Internet and Information Systems (TIIS) / v.12, no.8, 2018 , pp. 3725-3748 More about this Journal
Abstract
Text data distribution is often imbalanced. Imbalanced data is one of the challenges in text classification, as it leads to the loss of performance of classifiers. Many studies have been conducted so far in this regard. The proposed solutions are divided into several general categories, include sampling-based and algorithm-based methods. In recent studies, feature selection has also been considered as one of the solutions for the imbalance problem. In this paper, a novel one-sided feature selection known as probabilistic feature selection (PFS) was presented for imbalanced text classification. The PFS is a probabilistic method that is calculated using feature distribution. Compared to the similar methods, the PFS has more parameters. In order to evaluate the performance of the proposed method, the feature selection methods including Gini, MI, FAST and DFS were implemented. To assess the proposed method, the decision tree classifications such as C4.5 and Naive Bayes were used. The results of tests on Reuters-21875 and WebKB figures per F-measure suggested that the proposed feature selection has significantly improved the performance of the classifiers.
Keywords
Feature selection; Imbalanced class; High dimensionality; Text classification;
Citations & Related Records
연도 인용수 순위
  • Reference
1 A. Y. Ng, "Feature selection, L1 vs. L2 regularization, and rotational invariance," in Proc. of presented at the Proceedings of the twenty-first international conference on Machine learning, Banff, Alberta, Canada, 2004.
2 M. Alibeigi, S. Hashemi, and A. Hamzeh, "DBFS: An effective Density Based Feature Selection scheme for small sample size and high dimensional imbalanced data sets," Data & Knowledge Engineering, vol. 81-82, pp. 67-103, 2012.   DOI
3 S. Maldonadoa, R. Weberb, and F. Famili, "Feature selection for high-dimensional class-imbalanced data sets using Support Vector Machines," National Research Council of Canada, Ottawa, Canada Information Sciences, vol. 286, pp. 228-246, 2014.
4 J. Pouramini and B. Minaei-Bidgoli, "A New Synthetic Oversampling Method Using Ontology and Feature Selection in Order to Improve Imbalanced Textual Data Classification in Persian Texts," Bulletin de la Societe Royale des Sciences de Liege, vol. 85, pp. 358-375, 2016.
5 E. Chen, Y. Lin, H. Xiong, Q. Luo, and H. Ma, "Exploiting probabilistic topic models to improve text categorization under class imbalance," Information Processing & Management, vol. 47, pp. 202-214, 2011.   DOI
6 E. L. Iglesias, A. Seara Vieira, and L. Borrajo, "An HMM-based over-sampling technique to improve text classification," Expert Systems with Applications, vol. 40, pp. 7184-7192, 2013.   DOI
7 R. Barandela, R. M. Valdovinos, J .S. Sánchez, and F. J. Ferri, "The imbalanced training sample problem: Under or over sampling?," in Proc. of Joint IAPR International Workshops on Statistical Techniques in Pattern Recognition (SPR) and Structural and Syntactic Pattern Recognition (SSPR), pp.814-806, 2004.
8 N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, "SMOTE: synthetic minority over-sampling technique," Journal of artificial intelligence research, vol. 16, pp. 321-357, 2002.   DOI
9 S. Barua, M. M. Islam, X. Yao, and K. Murase,"MWMOTE--majority weighted minority oversampling technique for imbalanced data set learning," Knowledge and Data Engineering, IEEE Transactions on, vol. 26, pp. 405-425, 2014.   DOI
10 A. Sun, E.-P. Lim, and Y. Liu, "On strategies for imbalanced text classification using SVM: A comparative study," Decision Support Systems, vol. 48, pp. 191-201, 2009.   DOI
11 H. Jing, B. Wang, Y. Yang, and Y. Xu, "A General Framework of Feature Selection for Text Categorization," in Proc. of Machine Learning and Data Mining in Pattern Recognition: 6th International Conference, MLDM 2009, Leipzig ,Germany, July 23-25, 2009. Proceedings, P. Perner, Ed., ed Berlin, Heidelberg: Springer Berlin Heidelberg, pp. 647-66, 2009.
12 Y. Sun, M. S. Kamel, A. K. Wong, and Y. Wang, "Cost-sensitive boosting for classification of imbalanced data," Pattern Recognition, vol. 40, pp. 3358-3378, 2007.   DOI
13 K. Bache and M. Lichman, "UCI machine learning repository," ed, 2013.
14 M. Craven, D. DiPasquo, D. Freitag, A. McCallum, T. Mitchell, K. Nigam, et al., "Learning to extract symbolic knowledge from the World Wide Web," in Proc. of presented at the Proceedings of the fifteenth national/tenth conference on Artificial intelligence/Innovative applications of artificial intelligence, Madison, Wisconsin, USA, 1998.
15 Y. Liu, H. T. Loh, and A. Sun, "Imbalanced text classification: A term weighting approach," Expert Systems with Applications, vol. 36, pp. 690-701, 2009.   DOI
16 C. Sanchez-Hernandez, D. S. Boyd, and G. M. Foody, "One-class classification for mapping a specific land-cover class: SVDD classification of fenland," IEEE Transactions on Geoscience and Remote Sensing, vol. 45, pp. 1061-1073, 2007.   DOI
17 S. S. Khan and M. G. Madden, "A survey of recent trends in one class classification," in Proc. of Irish conference on Artificial Intelligence and Cognitive Science, pp. 188-197, 2009.
18 K. M. Ting, "A comparative study of cost-sensitive boosting algorithms," in Proc. of Proceedings of the 17th International Conference on Machine Learning, 2000.
19 F. Cheng, J. Zhang, C. Wen, Z. Liu, and Z. Li, "Large cost-sensitive margin distribution machine for imbalanced data classification," Neurocomputing, vol. 224, pp. 45-57, 2017.   DOI
20 X.-w. Chen and M. Wasikowski, "FAST: a roc-based feature selection metric for small samples and imbalanced data classification problems," in Proc. of presented at the Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, Las Vegas, Nevada, USA, 2008.
21 Y. Xu, "A Comparative Study on Feature Selection in Unbalance Text Classification," in Proc. of presented at the Proceedings of the 2012 Fourth International Symposium on Information Science and Engineering, 2012.
22 H. Liu and L. Yu, "Toward integrating feature selection algorithms for classification and clustering," IEEE Transactions on knowledge and data engineering, vol. 17, pp. 491-502, 2005.   DOI
23 S. Chua and N. Kulathuramaiyer, "Feature selection semantic based," Springer Netherlands, Innovations and Advanced Techniques in Systems, Computing Sciences and Software Engineering, pp. 471-476, 2008.
24 A. Khan, B. Baharudin, and K. Khan, "Efficient Feature Selection and Domain Relevance Term Weighting Method for Document Classification," IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, vol. 2, pp. 398-403, 2010.
25 W. Zong, F. Wu, L.-K. Chu, and D. Sculli, "A discriminative and semantic feature selection method for text categorization," International Journal of Production Economics, vol. 165, pp. 215-222, 2015.   DOI
26 A. Rehman, K. Javed, and H. A. Babri, "Feature selection based on a normalized difference measure for text classification," Information Processing & Management, vol. 53, pp. 473-489, 2017.   DOI
27 A. Rehman, K. Javed, H. A. Babri, and M. Saeed, "Relative discrimination criterion - A novel feature ranking method for text data," Expert Systems with Applications, vol. 42, pp. 3670-3681, 2015.   DOI
28 Y. Wang, Y. Liu, L. Feng, and X. Zhu, "Novel feature selection method based on harmony search for email classification," Knowledge-Based Systems, vol. 73, pp. 311-323, 2015.   DOI
29 R. K. Roul, A. Bhalla, and A. Srivastava, "Commonality-Rarity Score Computation: A novel Feature Selection Technique using Extended Feature Space of ELM for Text Classification," in Proc. of presented at the Proceedings of the 8th annual meeting of the Forum on Information Retrieval Evaluation, Kolkata, India, 2016.
30 M. Wasikowski and X.-w. Chen, "Combating the small sample class imbalance problem using feature selection," IEEE Transactions on knowledge and data engineering, vol. 22, pp. 1388-1400, 2010.   DOI
31 W. Shang, H. Huang, H. Zhu, Y. Lin, Y. Qu, and Z. Wang, "A novel feature selection algorithm for text categorization," Expert Systems with Applications, vol. 33, pp. 1-5, 2007.   DOI
32 A. K. Uysal and S. Gunal, "A novel probabilistic feature selection method for text classification," Knowledge-Based Systems, vol. 36, pp. 226-235, 2012.   DOI
33 H. He and E. A. Garcia, "Learning from Imbalanced Data," IEEE Transactions on Knowledge and Data Engineering, vol. 21, pp. 1263-1284, 2009.   DOI
34 P. Yang, W. Liu, B. B. Zhou, S. Chawla, and A. Y. Zomaya, "Ensemble-based wrapper methods for feature," springer, Advances in Knowledge Discovery and Data Mining, vol. 7818, pp. 544-555, 2013.
35 M. Galar, A. Fernandez, E. Barrenechea, H. Bustince, and F. Herrera, "A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches," IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), vol. 42, pp. 463-484, 2012.   DOI
36 N. V. Chawla, N. Japkowicz, and A. Kotcz, "Editorial: special issue on learning from imbalanced data sets," SIGKDD Explor. Newsl., vol. 6, pp. 1-6, 2004.   DOI
37 J. V. Hulse, T. M. Khoshgoftaar, and A. Napolitano, "Experimental perspectives on learning from imbalanced data," in Proc. of presented at the Proceedings of the 24th international conference on Machine learning, Corvalis, Oregon, USA, 2007.
38 F. Sebastiani, "Machine learning in automated text categorization," ACM computing surveys (CSUR), vol. 34, pp. 1-47, 2002.   DOI
39 H. Ogura, H. Amano, and M. Kondo, "Comparison of metrics for feature selection in imbalanced text classification," Expert Systems with Applications, vol. 38, pp. 4978-4989, 2011.   DOI
40 G. Forman, "An extensive empirical study of feature selection metrics for text classification," Journal of machine learning research, vol. 3, pp. 1289-1305, 2003.
41 G. S. Yanling Li and Y. Zhu, "Data imbalance problem in text classification," in Proc. of IEEE ,Third International Symposium on Information Processing, 2010.
42 Z. Zheng and R. S. X Wu, "Feature Selection for Text Categorization on Imbalanced Data," ACM SIGKDD Explorations Newsletter, 2004.
43 I. Guyon, S. Gunn, M. Nikravesh, and L. A. Zadeh, Feature extraction: foundations and applications vol. 207: Springer, 2008.
44 Y. Liu, G. Wang, H. Chen, H. Dong, X. Zhu, and S. Wang, "An Improved Particle Swarm Optimization for Feature Selection," Journal of Bionic Engineering, vol. 8, pp. 191-200, 2011.   DOI
45 A. K. Uysal and S. Gunal" ,Text classification using genetic algorithm oriented latent semantic features," Expert Systems with Applications, vol. 41, pp. 5938-5947, 2014.   DOI
46 A. Moayedikia, K.-L. Ong, Y. L. Boo, W. G. S. Yeoh, and R. Jensen, "Feature selection for high dimensional imbalanced class data using harmony search," Engineering Applications of Artificial Intelligence, vol. 57, pp. 38-49, 2017.   DOI