Browse > Article
http://dx.doi.org/10.13088/jiis.2022.28.3.023

Class Imbalance Resolution Method and Classification Algorithm Suggesting Based on Dataset Type Segmentation  

Kim, Jeonghun (Graduate School of Business IT, Kookmin University)
Kwahk, Kee-Young (College of Business Administration/Graduate School of Business IT, Kookmin University)
Publication Information
Journal of Intelligence and Information Systems / v.28, no.3, 2022 , pp. 23-43 More about this Journal
Abstract
In order to apply AI (Artificial Intelligence) in various industries, interest in algorithm selection is increasing. Algorithm selection is largely determined by the experience of a data scientist. However, in the case of an inexperienced data scientist, an algorithm is selected through meta-learning based on dataset characteristics. However, since the selection process is a black box, it was not possible to know on what basis the existing algorithm recommendation was derived. Accordingly, this study uses k-means cluster analysis to classify types according to data set characteristics, and to explore suitable classification algorithms and methods for resolving class imbalance. As a result of this study, four types were derived, and an appropriate class imbalance resolution method and classification algorithm were recommended according to the data set type.
Keywords
Class Imbalance; Meta Learning; Dataset Type; Clustering Analysis; Data characteristics;
Citations & Related Records
Times Cited By KSCI : 1  (Citation Analysis)
연도 인용수 순위
1 Kim, J., Kim, M. Y., & Kwon, O. (2020). The Effect of Meta-Features of Multiclass Datasets on the Performance of Classification Algorithms. Journal of Intelligence and Information Systems, 26(1), 23-45.   DOI
2 Anwar, N., Jones, G., & Ganesh, S. (2014). Measurement of data complexity for classification problems with unbalanced data. Statistical Analysis and Data Mining: The ASA Data Science Journal, 7(3), 194-211.   DOI
3 Ho, T. K., & Basu, M. (2002). Complexity measures of supervised classification problems. IEEE transactions on pattern analysis and machine intelligence, 24(3), 289-300.   DOI
4 Zhang, X., Li, R., Zhang, B., Yang, Y., Guo, J., & Ji, X. (2019). An instance-based learning recommendation algorithm of imbalance handling methods. Applied Mathematics and Computation, 351, 204-218.   DOI
5 Amin, A., Anwar, S., Adnan, A., Nawaz, M., Howard, N., Qadir, J., ... & Hussain, A. (2016). Comparing oversampling techniques to handle the class imbalance problem: A customer churn prediction case study. IEEE Access, 4, 7940-7957.   DOI
6 Pimentel, B. A., & De Carvalho, A. C. (2019). A new data characterization for selecting clustering algorithms using meta-learning. Information Sciences, 477, 203-219.   DOI
7 Strubell, E., Ganesh, A., & McCallum, A. (2019). Energy and policy considerations for deep learning in NLP. arXiv preprint arXiv:1906.02243.
8 Weiss, G. M., & Provost, F. (2003). Learning when training data are costly: The effect of class distribution on tree induction. Journal of artificial intelligence research, 19, 315-354.   DOI
9 Cano, J. R. (2013). Analysis of data complexity measures for classification. Expert systems with applications, 40(12), 4820-4831.   DOI
10 George, G., Haas, M. R., & Pentland, A. (2014). Big data and management. Academy of management Journal, 57(2), 321-326.   DOI
11 Huang, Y. M., Hung, C. M., & Jiau, H. C. (2006). Evaluation of neural networks and data mining methods on a credit assessment task for class imbalance problem. Nonlinear Analysis: Real World Applications, 7(4), 720-747.   DOI
12 Khoshgoftaar, T. M., Van Hulse, J., & Napolitano, A. (2010). Comparing boosting and bagging techniques with noisy and imbalanced data. IEEE Transactions on Systems, Man, and Cybernetics-Part A: Systems and Humans, 41(3), 552-568.   DOI
13 Kim, J., & Kwon, O. (2021). A model for rapid selection and covid-19 prediction with dynamic and imbalanced data. Sustainability, 13(6), 3099.   DOI
14 Lee, S., & Shin, T. (2018). Development and application of prediction model of hyperlipidemia using SVM and meta-learning algorithm. Journal of Intelligence and Information Systems, 24(2), 111-124.   DOI
15 Lorena, A. C., Maciel, A. I., de Miranda, P. B., Costa, I. G., & Prudencio, R. B. (2018). Data complexity meta-features for regression problems. Machine Learning, 107(1), 209-246.   DOI
16 Merz, P. (2004). Advanced fitness landscape analysis and the performance of memetic algorithms. Evolutionary Computation, 12(3), 303-325.   DOI
17 Jo, T., & Japkowicz, N. (2004). Class imbalances versus small disjuncts. ACM Sigkdd Explorations Newsletter, 6(1), 40-49.   DOI
18 Khan, I., Zhang, X., Rehman, M., & Ali, R. (2020). A literature survey and empirical study of meta-learning for classifier selection. IEEE Access, 8, 10262-10281.   DOI
19 Kim, E., & Hong, T. (2015). Response Modeling for the Marketing Promotion with Weighted Case Based Reasoning Under Imbalanced Data Distribution. Journal of Intelligence and Information Systems, 21(1), 29-45.   DOI
20 Kotsiantis, S., & Kanellopoulos, D. (2006). Discretization techniques: A recent survey. GESTS International Transactions on Computer Science and Engineering, 32(1), 47-58.
21 Krawczyk, B. (2016). Learning from imbalanced data: open challenges and future directions. Progress in Artificial Intelligence, 5(4), 221-232.   DOI
22 Leyva, E., Gonzalez, A., & Perez, R. (2014). A set of complexity measures designed for applying meta-learning to instance selection. IEEE Transactions on Knowledge and Data Engineering, 27(2), 354-367.   DOI
23 Lu, W. Z., & Wang, D. (2008). Ground-level ozone prediction by support vector machine approach with a cost-sensitive classification scheme. Science of the total environment, 395(2-3), 109-116.   DOI
24 Matsumoto, A., Merlone, U., & Szidarovszky, F. (2012). Some notes on applying the Herfindahl-Hirschman Index. Applied Economics Letters, 19(2), 181-184.   DOI
25 Van der Walt, C. M., & Barnard, E. (2007). Data characteristics that determine classifier performance. SAIEE Africa Research Journal, 98(3), 87-93.   DOI
26 Munoz, M. A., Sun, Y., Kirley, M., & Halgamuge, S. K. (2015). Algorithm selection for black-box continuous optimization problems: A survey on methods and challenges. Information Sciences, 317, 224-245.   DOI
27 Park, G. U., & Jung, I. (2019). Comparison of resampling methods for dealing with imbalanced data in binary classification problem. The Korean Journal of Applied Statistics, 32(3), 349-374.   DOI
28 Qureshi, S. R., & Gupta, A. (2014, March). Towards efficient Big Data and data analytics: A review. In 2014 Conference on IT in Business, Industry and Government (CSIBIG) (pp. 1-6). IEEE.
29 Rossi, A. L. D., de Leon Ferreira, A. C. P., Soares, C., & De Souza, B. F. (2014). MetaStream: A meta-learning based method for periodic algorithm selection in time-changing data. Neurocomputing, 127, 52-64.   DOI
30 Sun, A., Lim, E. P., & Liu, Y. (2009). On strategies for imbalanced text classification using SVM: A comparative study. Decision Support Systems, 48(1), 191-201.   DOI
31 Wolpert, D. H., & Macready, W. G. (1997). No free lunch theorems for optimization. IEEE transactions on evolutionary computation, 1(1), 67-82.   DOI
32 Pasupa, K., Vatathanavaro, S., & Tungjitnob, S. (2020). Convolutional neural networks based focal loss for class imbalance problem: a case study of canine red blood cells morphology classification. Journal of Ambient Intelligence and Humanized Computing, 1-17.
33 Lopez, V., Fernandez, A., Garcia, S., Palade, V., & Herrera, F. (2013). An insight into classification with imbalanced data: Empirical results and current trends on using data intrinsic characteristics. Information sciences, 250, 113-141.   DOI
34 Pfahringer, B., Bensusan, H., & Giraud-Carrier, C. G. (2000, June). Meta-Learning by Landmarking Various Learning Algorithms. In ICML (pp. 743-750).
35 Blagus, R., & Lusa, L. (2013). Improved shrunken centroid classifiers for high-dimensional class-imbalanced data. BMC bioinformatics, 14(1), 1-13.   DOI
36 Dogan, N., & Tanrikulu, Z. (2013). A comparative analysis of classification algorithms in data mining for accuracy, speed and robustness. Information Technology and Management, 14(2), 105-124.   DOI
37 Feng, S., Keung, J., Yu, X., Xiao, Y., Bennin, K. E., Kabir, M. A., & Zhang, M. (2021). COSTE: Complexity-based OverSampling TEchnique to alleviate the class imbalance problem in software defect prediction. Information and Software Technology, 129, 106432.   DOI
38 Ho, T. K. (2002). A data complexity analysis of comparative advantages of decision forest constructors. Pattern Analysis & Applications, 5(2), 102-112.   DOI
39 Pascual-Triana, J. D., Charte, D., Andres Arroyo, M., Fernandez, A., & Herrera, F. (2021). Revisiting data complexity metrics based on morphology for overlap and imbalance: snapshot, new overlap number of balls metrics and singular problems prospect. Knowledge and Information Systems, 63(7), 1961-1989.   DOI