DOI QR코드

DOI QR Code

A Novel Feature Selection Method in the Categorization of Imbalanced Textual Data

  • Pouramini, Jafar (Department of Computer and Information Technology Engineering, Faculty of Engineering, University of Qom) ;
  • Minaei-Bidgoli, Behrouze (Faculty of Computer Engineering, Iran University of Science and Technology) ;
  • Esmaeili, Mahdi (Faculty of Computer Engineering, Kashan Islamic Azad University)
  • Received : 2017.07.24
  • Accepted : 2018.02.17
  • Published : 2018.08.31

Abstract

Text data distribution is often imbalanced. Imbalanced data is one of the challenges in text classification, as it leads to the loss of performance of classifiers. Many studies have been conducted so far in this regard. The proposed solutions are divided into several general categories, include sampling-based and algorithm-based methods. In recent studies, feature selection has also been considered as one of the solutions for the imbalance problem. In this paper, a novel one-sided feature selection known as probabilistic feature selection (PFS) was presented for imbalanced text classification. The PFS is a probabilistic method that is calculated using feature distribution. Compared to the similar methods, the PFS has more parameters. In order to evaluate the performance of the proposed method, the feature selection methods including Gini, MI, FAST and DFS were implemented. To assess the proposed method, the decision tree classifications such as C4.5 and Naive Bayes were used. The results of tests on Reuters-21875 and WebKB figures per F-measure suggested that the proposed feature selection has significantly improved the performance of the classifiers.

Keywords

References

  1. H. He and E. A. Garcia, "Learning from Imbalanced Data," IEEE Transactions on Knowledge and Data Engineering, vol. 21, pp. 1263-1284, 2009. https://doi.org/10.1109/TKDE.2008.239
  2. P. Yang, W. Liu, B. B. Zhou, S. Chawla, and A. Y. Zomaya, "Ensemble-based wrapper methods for feature," springer, Advances in Knowledge Discovery and Data Mining, vol. 7818, pp. 544-555, 2013.
  3. M. Galar, A. Fernandez, E. Barrenechea, H. Bustince, and F. Herrera, "A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches," IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), vol. 42, pp. 463-484, 2012. https://doi.org/10.1109/TSMCC.2011.2161285
  4. N. V. Chawla, N. Japkowicz, and A. Kotcz, "Editorial: special issue on learning from imbalanced data sets," SIGKDD Explor. Newsl., vol. 6, pp. 1-6, 2004. https://doi.org/10.1145/1046456.1046457
  5. J. V. Hulse, T. M. Khoshgoftaar, and A. Napolitano, "Experimental perspectives on learning from imbalanced data," in Proc. of presented at the Proceedings of the 24th international conference on Machine learning, Corvalis, Oregon, USA, 2007.
  6. F. Sebastiani, "Machine learning in automated text categorization," ACM computing surveys (CSUR), vol. 34, pp. 1-47, 2002. https://doi.org/10.1145/505282.505283
  7. H. Ogura, H. Amano, and M. Kondo, "Comparison of metrics for feature selection in imbalanced text classification," Expert Systems with Applications, vol. 38, pp. 4978-4989, 2011. https://doi.org/10.1016/j.eswa.2010.09.153
  8. S. Maldonadoa, R. Weberb, and F. Famili, "Feature selection for high-dimensional class-imbalanced data sets using Support Vector Machines," National Research Council of Canada, Ottawa, Canada Information Sciences, vol. 286, pp. 228-246, 2014.
  9. J. Pouramini and B. Minaei-Bidgoli, "A New Synthetic Oversampling Method Using Ontology and Feature Selection in Order to Improve Imbalanced Textual Data Classification in Persian Texts," Bulletin de la Societe Royale des Sciences de Liege, vol. 85, pp. 358-375, 2016.
  10. E. Chen, Y. Lin, H. Xiong, Q. Luo, and H. Ma, "Exploiting probabilistic topic models to improve text categorization under class imbalance," Information Processing & Management, vol. 47, pp. 202-214, 2011. https://doi.org/10.1016/j.ipm.2010.07.003
  11. E. L. Iglesias, A. Seara Vieira, and L. Borrajo, "An HMM-based over-sampling technique to improve text classification," Expert Systems with Applications, vol. 40, pp. 7184-7192, 2013. https://doi.org/10.1016/j.eswa.2013.07.036
  12. R. Barandela, R. M. Valdovinos, J .S. Sánchez, and F. J. Ferri, "The imbalanced training sample problem: Under or over sampling?," in Proc. of Joint IAPR International Workshops on Statistical Techniques in Pattern Recognition (SPR) and Structural and Syntactic Pattern Recognition (SSPR), pp.814-806, 2004.
  13. N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, "SMOTE: synthetic minority over-sampling technique," Journal of artificial intelligence research, vol. 16, pp. 321-357, 2002. https://doi.org/10.1613/jair.953
  14. S. Barua, M. M. Islam, X. Yao, and K. Murase,"MWMOTE--majority weighted minority oversampling technique for imbalanced data set learning," Knowledge and Data Engineering, IEEE Transactions on, vol. 26, pp. 405-425, 2014. https://doi.org/10.1109/TKDE.2012.232
  15. A. Sun, E.-P. Lim, and Y. Liu, "On strategies for imbalanced text classification using SVM: A comparative study," Decision Support Systems, vol. 48, pp. 191-201, 2009. https://doi.org/10.1016/j.dss.2009.07.011
  16. Y. Liu, H. T. Loh, and A. Sun, "Imbalanced text classification: A term weighting approach," Expert Systems with Applications, vol. 36, pp. 690-701, 2009. https://doi.org/10.1016/j.eswa.2007.10.042
  17. C. Sanchez-Hernandez, D. S. Boyd, and G. M. Foody, "One-class classification for mapping a specific land-cover class: SVDD classification of fenland," IEEE Transactions on Geoscience and Remote Sensing, vol. 45, pp. 1061-1073, 2007. https://doi.org/10.1109/TGRS.2006.890414
  18. S. S. Khan and M. G. Madden, "A survey of recent trends in one class classification," in Proc. of Irish conference on Artificial Intelligence and Cognitive Science, pp. 188-197, 2009.
  19. K. M. Ting, "A comparative study of cost-sensitive boosting algorithms," in Proc. of Proceedings of the 17th International Conference on Machine Learning, 2000.
  20. F. Cheng, J. Zhang, C. Wen, Z. Liu, and Z. Li, "Large cost-sensitive margin distribution machine for imbalanced data classification," Neurocomputing, vol. 224, pp. 45-57, 2017. https://doi.org/10.1016/j.neucom.2016.10.053
  21. X.-w. Chen and M. Wasikowski, "FAST: a roc-based feature selection metric for small samples and imbalanced data classification problems," in Proc. of presented at the Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, Las Vegas, Nevada, USA, 2008.
  22. Y. Xu, "A Comparative Study on Feature Selection in Unbalance Text Classification," in Proc. of presented at the Proceedings of the 2012 Fourth International Symposium on Information Science and Engineering, 2012.
  23. H. Liu and L. Yu, "Toward integrating feature selection algorithms for classification and clustering," IEEE Transactions on knowledge and data engineering, vol. 17, pp. 491-502, 2005. https://doi.org/10.1109/TKDE.2005.66
  24. S. Chua and N. Kulathuramaiyer, "Feature selection semantic based," Springer Netherlands, Innovations and Advanced Techniques in Systems, Computing Sciences and Software Engineering, pp. 471-476, 2008.
  25. A. Khan, B. Baharudin, and K. Khan, "Efficient Feature Selection and Domain Relevance Term Weighting Method for Document Classification," IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, vol. 2, pp. 398-403, 2010.
  26. W. Zong, F. Wu, L.-K. Chu, and D. Sculli, "A discriminative and semantic feature selection method for text categorization," International Journal of Production Economics, vol. 165, pp. 215-222, 2015. https://doi.org/10.1016/j.ijpe.2014.12.035
  27. A. Rehman, K. Javed, and H. A. Babri, "Feature selection based on a normalized difference measure for text classification," Information Processing & Management, vol. 53, pp. 473-489, 2017. https://doi.org/10.1016/j.ipm.2016.12.004
  28. A. Rehman, K. Javed, H. A. Babri, and M. Saeed, "Relative discrimination criterion - A novel feature ranking method for text data," Expert Systems with Applications, vol. 42, pp. 3670-3681, 2015. https://doi.org/10.1016/j.eswa.2014.12.013
  29. Y. Wang, Y. Liu, L. Feng, and X. Zhu, "Novel feature selection method based on harmony search for email classification," Knowledge-Based Systems, vol. 73, pp. 311-323, 2015. https://doi.org/10.1016/j.knosys.2014.10.013
  30. R. K. Roul, A. Bhalla, and A. Srivastava, "Commonality-Rarity Score Computation: A novel Feature Selection Technique using Extended Feature Space of ELM for Text Classification," in Proc. of presented at the Proceedings of the 8th annual meeting of the Forum on Information Retrieval Evaluation, Kolkata, India, 2016.
  31. M. Wasikowski and X.-w. Chen, "Combating the small sample class imbalance problem using feature selection," IEEE Transactions on knowledge and data engineering, vol. 22, pp. 1388-1400, 2010. https://doi.org/10.1109/TKDE.2009.187
  32. W. Shang, H. Huang, H. Zhu, Y. Lin, Y. Qu, and Z. Wang, "A novel feature selection algorithm for text categorization," Expert Systems with Applications, vol. 33, pp. 1-5, 2007. https://doi.org/10.1016/j.eswa.2006.04.001
  33. A. K. Uysal and S. Gunal, "A novel probabilistic feature selection method for text classification," Knowledge-Based Systems, vol. 36, pp. 226-235, 2012. https://doi.org/10.1016/j.knosys.2012.06.005
  34. G. Forman, "An extensive empirical study of feature selection metrics for text classification," Journal of machine learning research, vol. 3, pp. 1289-1305, 2003.
  35. G. S. Yanling Li and Y. Zhu, "Data imbalance problem in text classification," in Proc. of IEEE ,Third International Symposium on Information Processing, 2010.
  36. Z. Zheng and R. S. X Wu, "Feature Selection for Text Categorization on Imbalanced Data," ACM SIGKDD Explorations Newsletter, 2004.
  37. I. Guyon, S. Gunn, M. Nikravesh, and L. A. Zadeh, Feature extraction: foundations and applications vol. 207: Springer, 2008.
  38. Y. Liu, G. Wang, H. Chen, H. Dong, X. Zhu, and S. Wang, "An Improved Particle Swarm Optimization for Feature Selection," Journal of Bionic Engineering, vol. 8, pp. 191-200, 2011. https://doi.org/10.1016/S1672-6529(11)60020-6
  39. A. K. Uysal and S. Gunal" ,Text classification using genetic algorithm oriented latent semantic features," Expert Systems with Applications, vol. 41, pp. 5938-5947, 2014. https://doi.org/10.1016/j.eswa.2014.03.041
  40. A. Moayedikia, K.-L. Ong, Y. L. Boo, W. G. S. Yeoh, and R. Jensen, "Feature selection for high dimensional imbalanced class data using harmony search," Engineering Applications of Artificial Intelligence, vol. 57, pp. 38-49, 2017. https://doi.org/10.1016/j.engappai.2016.10.008
  41. A. Y. Ng, "Feature selection, L1 vs. L2 regularization, and rotational invariance," in Proc. of presented at the Proceedings of the twenty-first international conference on Machine learning, Banff, Alberta, Canada, 2004.
  42. M. Alibeigi, S. Hashemi, and A. Hamzeh, "DBFS: An effective Density Based Feature Selection scheme for small sample size and high dimensional imbalanced data sets," Data & Knowledge Engineering, vol. 81-82, pp. 67-103, 2012. https://doi.org/10.1016/j.datak.2012.08.001
  43. H. Jing, B. Wang, Y. Yang, and Y. Xu, "A General Framework of Feature Selection for Text Categorization," in Proc. of Machine Learning and Data Mining in Pattern Recognition: 6th International Conference, MLDM 2009, Leipzig ,Germany, July 23-25, 2009. Proceedings, P. Perner, Ed., ed Berlin, Heidelberg: Springer Berlin Heidelberg, pp. 647-66, 2009.
  44. Y. Sun, M. S. Kamel, A. K. Wong, and Y. Wang, "Cost-sensitive boosting for classification of imbalanced data," Pattern Recognition, vol. 40, pp. 3358-3378, 2007. https://doi.org/10.1016/j.patcog.2007.04.009
  45. K. Bache and M. Lichman, "UCI machine learning repository," ed, 2013.
  46. M. Craven, D. DiPasquo, D. Freitag, A. McCallum, T. Mitchell, K. Nigam, et al., "Learning to extract symbolic knowledge from the World Wide Web," in Proc. of presented at the Proceedings of the fifteenth national/tenth conference on Artificial intelligence/Innovative applications of artificial intelligence, Madison, Wisconsin, USA, 1998.