DOI QR코드

DOI QR Code

Re-SSS: Rebalancing Imbalanced Data Using Safe Sample Screening

  • Shi, Hongbo (School of Information, Shanxi University of Finance and Economics) ;
  • Chen, Xin (School of Information, Shanxi University of Finance and Economics) ;
  • Guo, Min (School of Information, Shanxi University of Finance and Economics)
  • Received : 2019.12.26
  • Accepted : 2020.06.07
  • Published : 2021.02.28

Abstract

Different samples can have different effects on learning support vector machine (SVM) classifiers. To rebalance an imbalanced dataset, it is reasonable to reduce non-informative samples and add informative samples for learning classifiers. Safe sample screening can identify a part of non-informative samples and retain informative samples. This study developed a resampling algorithm for Rebalancing imbalanced data using Safe Sample Screening (Re-SSS), which is composed of selecting Informative Samples (Re-SSS-IS) and rebalancing via a Weighted SMOTE (Re-SSS-WSMOTE). The Re-SSS-IS selects informative samples from the majority class, and determines a suitable regularization parameter for SVM, while the Re-SSS-WSMOTE generates informative minority samples. Both Re-SSS-IS and Re-SSS-WSMOTE are based on safe sampling screening. The experimental results show that Re-SSS can effectively improve the classification performance of imbalanced classification problems.

Keywords

Acknowledgement

This work is supported by the National Natural Science Foundation of China (No. 61801279), the Key Research and Development Project of Shanxi Province (No. 201903D121160), and the Natural Science Foundation of Shanxi Province (No. 201801D121115 and 201901D111318).

References

  1. K. Kourou, T. P. Exarchos, K. P. Exarchos, M. V. Karamouzis, and D. I. Fotiadis, "Machine learning applications in cancer prognosis and prediction," Computational and Structural Biotechnology Journal, vol. 13, pp. 8-17, 2015. https://doi.org/10.1016/j.csbj.2014.11.005
  2. D. Sanchez, M. A. Vila, L. Cerda, and J. M. Serrano, "Association rules applied to credit card fraud detection," Expert Systems with Applications, vol. 36, no. 2, pp. 3630-3640, 2009. https://doi.org/10.1016/j.eswa.2008.02.001
  3. R. A. R. Ashfaq, X. Z. Wang, J. Z. Huang, H. Abbas, and Y. L. He, "Fuzziness based semi-supervised learning approach for intrusion detection system," Information Sciences, vol. 378, no. 1, pp. 484-497, 2017. https://doi.org/10.1016/j.ins.2016.04.019
  4. X. Y. Liu, J. Wu, and Z. H. Zhou, "Exploratory undersampling for class-imbalance learning," IEEE Transactions On Systems Man And Cybernetics, Part B (Cybernetics), vol. 39, no. 2, pp. 539-550, 2009.
  5. D. Devi, S. K. Biswas, and B. Purkayastha, "Learning in presence of class imbalance and class overlapping by using one-class SVM and undersampling technique," Connection Science, vol. 31, no. 2, pp. 105-142, 2019. https://doi.org/10.1080/09540091.2018.1560394
  6. A. Onan, "Consensus clustering-based undersampling approach to imbalanced learning," Scientific Programming, vol. 2019, article no. 5901087, 2019. https://doi.org/10.1155/2019/5901087
  7. H. Han, W. Y. Wang, and B. H. Mao, "Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning," in Advances in Intelligent Computing. Heidelberg, Germany: Springer, 2005, pp. 878-887.
  8. N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, "SMOTE: synthetic minority over-sampling technique," Journal of Artificial Intelligence Research, vol.16, pp. 321-357, 2002. https://doi.org/10.1613/jair.953
  9. M. Koziarski, B. Krawczyk, and M. Wozniak, "Radial-based oversampling for noisy imbalanced data classification," Neurocomputing, vol. 343, pp. 19-33, 2019. https://doi.org/10.1016/j.neucom.2018.04.089
  10. R. Malhotra and S. Kamal, "An empirical study to investigate oversampling methods for improving software defect prediction using imbalanced data," Neurocomputing, vol. 343, pp. 120-140, 2019. https://doi.org/10.1016/j.neucom.2018.04.090
  11. G. Dimic, D. Rancic, N. Macek, P. Spalevic, and V. Drasute, "Improving the prediction accuracy in blended learning environment using synthetic minority oversampling technique," Information Discovery and Delivery, vol. 47, no. 2, pp. 76-83, 2019. https://doi.org/10.1108/IDD-08-2018-0036
  12. Q. Wang, "A hybrid sampling SVM approach to imbalanced data classification," Abstract and Applied Analysis, vol. 2014, article no. 973786, 2014. https://doi.org/10.1155/2014/972786
  13. Z. Hu, R. Chiong, I. Pranata, Y. Bao, and Y. Lin, "Malicious web domain identification using online credibility and performance data by considering the class imbalance issue," Industrial Management & Data Systems, vol. 119, no. 3, pp. 676-696, 2019. https://doi.org/10.1108/IMDS-02-2018-0072
  14. M. Bach, A. Werner, J. Zywiec, and W. Pluskiewicz, "The study of under- and over-sampling methods' utility in analysis of highly imbalanced data on osteoporosis," Information Sciences, vol. 384, pp. 174-190, 2017. https://doi.org/10.1016/j.ins.2016.09.038
  15. N. Japkowicz and S. Stephen, "The class imbalance problem: a systematic study," Intelligent Data Analysis, vol. 6, no. 5, pp. 429-449, 2002. https://doi.org/10.3233/IDA-2002-6504
  16. J. Y. Chen, J. Lalor, W. S. Liu, E. Druhl, E. Granillo, V. G. Vimalananda, and H. Yu, "Detecting hypoglycemia incidents reported in patients' secure messages: using cost-sensitive learning and oversampling to reduce data imbalance," Journal of Medical Internet Research, vol. 21, no. 3, article no. e11990, 2019. https://doi.org/10.2196/11990
  17. P. A. Alaba, S. I. Popoola, L. Olatomiwa, M. B. Akanle, O. S. Ohunakin, E. Adetiba, O. D. Alex, A. A. A. Atayero, and W. M. A. W. Daud, "Towards a more efficient and cost-sensitive extreme learning machine: A state-of-the-art review of recent trend," Neurocomputing, vol. 350, pp. 70-90, 2019. https://doi.org/10.1016/j.neucom.2019.03.086
  18. Z. Sun, Q. Song, X. Zhu, H. Sun, B. Xu, and Y. Zhou, "A novel ensemble method for classifying imbalanced data," Pattern Recognition, vol. 48, no. 5, pp. 1623-1637, 2015. https://doi.org/10.1016/j.patcog.2014.11.014
  19. A. Irtazal, S. M. Adnan, K. T. Ahmed, A. Jaffar, A. Khan, A. Javed, and M. T. Mahmood, "An ensemble based evolutionary approach to the class imbalance problem with applications in CBIR," Applied Sciences, vol. 8, no. 4, article no. 495, 2018. https://doi.org/10.3390/app8040495
  20. H. He, W. Zhang, and S. Zhang, "A novel ensemble method for credit scoring: adaption of different imbalance ratios," Expert Systems with Applications, vol. 98, pp. 105-117, 2018. https://doi.org/10.1016/j.eswa.2018.01.012
  21. D. C. Li, S. C. Hu, L. S. Lin, and C. W. Yeh, "Detecting representative data and generating synthetic samples to improve learning accuracy with imbalanced data sets," Plos One, vol. 12, no. 8, article no. e0181853, 2017. https://doi.org/10.1371/journal.pone.0181853
  22. Y. T. Yan, Z. B. Wu, X. Q. Du, J. Chen, S. Zhao, and Y. P. Zhang, "A three-way decision ensemble method for imbalanced data oversampling," International Journal of Approximate Reasoning, vol. 107, pp. 1-16, 2019. https://doi.org/10.1016/j.ijar.2018.12.011
  23. M. A. Naiel, M. O. Ahmad, M. N. S. Swamy, J. Lim, and M. H. Yang, "Online multi-object tracking via robust collaborative model and sample selection," Computer Vision and Image Understanding, vol. 154, pp. 94-107, 2017. https://doi.org/10.1016/j.cviu.2016.07.003
  24. M. A. H. Farquad and I. Bose, "Preprocessing unbalanced data using support vector machine," Decision Support Systems, vol. 53, no. 1, pp. 226-233, 2012. https://doi.org/10.1016/j.dss.2012.01.016
  25. S. J. Lin, "Integrated artificial intelligence-based resizing strategy and multiple criteria decision making technique to form a management decision in an imbalanced environment," International Journal of Machine Learning and Cybernetics, vol. 8, no. 6, pp. 1981-1992, 2016. https://doi.org/10.1007/s13042-016-0574-3
  26. T. Guo, J. Wang, Q. M. Liu, and J. Y. Liang, "Kernel SVM algorithm based on identifying key samples for imbalanced data," Pattern Recognition and Artificial Intelligence, vol. 32, no. 6, pp. 569-576, 2019.
  27. A. Shibagaki, M. Karasuyama, K. Hatano, and I. Takeuchi, "Simultaneous safe screening of features and samples in doubly sparse modeling," in Proceedings of the 33rd International Conference on Machine Learning (ICML), New York, NY, 2016, pp. 1577-1586.
  28. T. Hastie, S. Rosset, R. Tibshirani, and J. Zhu, "The entire regularization path for the support vector machine," Journal of Machine Learning Research, vol. 5, pp. 1391-1415, 2004.
  29. H. Shi, Q. Gao, S. Ji, and Y. Liu, "A hybrid sampling method based on safe screening for imbalanced datasets with sparse structure," in Proceedings of the 2018 International Joint Conference on Neural Networks (IJCNN), Rio de Janeiro, Brazil, 2018, pp. 1-8.
  30. H. Shi, Y. Liu, and S. Ji, "Safe sample screening based sampling method for imbalanced data," Pattern Recognition and Artificial Intelligence, vol. 32, no. 6, pp. 545-556, 2019.
  31. K. Ogawa, Y. Suzuki, and I. Takeuchi, "Safe screening of non-support vectors in pathwise SVM computation," in Proceedings of the 30th International Conference on Machine Learning (ICML), Atlanta, GA, 2013, pp. 1382-1390.
  32. A. Luque, A. Carrasco, A. Martin, and A. de las Heras, "The impact of class imbalance in classification performance metrics based on the binary confusion matrix," Pattern Recognition, vol. 91, pp. 216-231, 2019. https://doi.org/10.1016/j.patcog.2019.02.023