Browse > Article
http://dx.doi.org/10.5392/JKCA.2019.19.11.567

Handling Method of Imbalance Data for Machine Learning : Focused on Sampling  

Lee, Kyunam (충북대학교 빅데이터학과)
Lim, Jongtae (충북대학교 정보통신공학과)
Bok, Kyoungsoo (원광대학교 SW융합학과)
Yoo, Jaesoo (충북대학교 정보통신공학과)
Publication Information
Abstract
Recently, more and more attempts have been made to solve the problems faced by academia and industry through machine learning. Accordingly, various attempts are being made to solve non-general situations through machine learning, such as deviance, fraud detection and disability detection. A variety of attempts have been made to resolve the non-normal situation in which data is distributed disproportionately, generally resulting in errors. In this paper, we propose handling method of imbalance data for machine learning. The proposed method to such problem of an imbalance in data by verifying that the population distribution of major class is well extracted. Performance Evaluations have proven the proposed method to be better than the existing methods.
Keywords
Imbalance Data; Machine Learning; Under Sampling; Over Sampling; Anomaly Detection;
Citations & Related Records
연도 인용수 순위
  • Reference
1 Rushi Longadge, Snehlata S. Dongre, and Latesh Malik, "Class imbalance problem in data mining review," Internation Journal of Computer Science and Network, Vol.2, No.1, pp.1-6, 2013.
2 Joffrey L. Leevy, Taghi M. Khoshgoftaar, Richard A. Bauder, and Naeem Seliya, "A survey on addressing high-class imbalance in big data," Journal of Big Data, Vol.5, No.1, pp.1-30, 2018.   DOI
3 Zhaohui Zheng, Xiaoyun Wu, and Rohini Srihari, "Feature selection for text categorization on imbalanced data," ACM Sigkdd Explorations Newsletter, Vol.6, No.1, pp.80-89, 2004.   DOI
4 Peng Cao, Dazhe Zhao, and Osmar Zaiane, "An optimized cost-sensitive SVM for imbalanced data learning," Proc. Pacific-Asia conference on knowledge discovery and data mining, pp.280-292, 2013.
5 Peng Cao, Dazhe Zhao, and Osmar R. Zaiane, "A PSO-based cost-sensitive neural network for imbalanced data classification," Proc. Pacific-Asia conference on knowledge discovery and data mining, pp.452-463, 2013.
6 Alberto Fernandeza, Salvador Garcia, Maria Jose del Jesus, and Francisco Herrera, "A study of the behaviour of linguistic fuzzy rule based classification systems in the framework of imbalanced data-sets," Fuzzy Sets and Systems, Vol.159, No.18, pp.2378-2398, 2008.   DOI
7 S. Picek, A. Heuser, A. Jovic, S. Bhasin, and F. Regazzoni, "The curse of class imbalance and conflicting metrics with machine learning for side-channel evaluations," 2018.
8 Z. Chen, Q. Yan, H. Han, S. Wang, L. Peng, L. Wang, and B. Yang, "Machine learning based mobile malware detection using highly imbalanced network traffic," Information Sciences, Vol.433, pp.346-364, 2018.   DOI
9 I. Tomek, "An experiment with the edited nearest-neighbor rule," IEEE Transactions on systems, Man, and Cybernetics, Vol.6, No.6, pp.448-452, 1976.   DOI
10 Dennis L. Wilson, "Asymptotic properties of nearest neighbor rules using edited data," IEEE Transactions on Systems, Man, and Cybernetics, Vol.3, pp.408-421, 1972.   DOI
11 I. Tomek, "Two Modifications of CNN," IEEE Transactions on Systems, Man and Cybernetics, Vol.6, No.11, pp.769-772, 1976.   DOI
12 Kubat, Miroslav, and Stan Matwin, "Addressing the curse of imbalanced training sets: one-sided selection," Proc. International Conference on Machine Learning, Vol.97, pp.179-186, 1997.
13 J. Laurikkala, "Improving identification of difficult small classes by balancing class distribution," Proc. Conference on Artificial Intelligence in Medicine in Europe - Artificial Intelligence in Medicine, pp.63-66, 2001.
14 Mani, Inderjeet and I. Zhang, "kNN approach to unbalanced data distributions: a case study involving information extraction," Proc. workshop on learning from imbalanced datasets, Vol.126, 2003.
15 N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, "SMOTE: synthetic minority over-sampling technique," Journal of artificial intelligence research, Vol.16, No.1, pp.321-357, 2002.   DOI
16 H. He, Y. Bai, E. A. Garcia, and S. Li, "ADASYN: Adaptive synthetic sampling approach for imbalanced learning," Proc. IEEE International Joint Conference on Neural Networks, pp.1322-1328, 2008.
17 Arpit Singh and Anuradha Purohit, "A survey on methods for solving data imbalance problem for classification," International Journal of Computer Applications, Vol.127, No.15, pp.37-41, 2015.   DOI
18 Batista, Gustavo EAPA, Ana LC Bazzan, and Maria Carolina Monard, "Balancing Training Data for Automated Annotation of Keywords: a Case Study," Proc. Workshop on Bioinformatics, 2003.
19 Shaza M. Abd Elrahman and Ajith Abraham, "A review of class imbalance problem," Journal of Network and Innovative Computing, Vol.1, pp.332-340, 2013.
20 Haibo He and Edwardo A. Garcia, "Learning from imbalanced data," IEEE Transactions on Knowledge & Data Engineering, Vol.21, No.9, pp.1263-1284, 2009.   DOI
21 https://sci2s.ugr.es/keel/imbalanced.php?order=ir#sub10, 2019.8.18.
22 Batista, Gustavo EAPA, Ronaldo C. Prati and Maria Carolina Monard, "A study of the behavior of several methods for balancing machine learning training data," SIGKDD Explorations, Vol.6, No.1, pp.20-29, 2004.   DOI