[KSCI] Korea Science Citation Index Service

http://dx.doi.org/10.13088/jiis.2022.28.3.023

Class Imbalance Resolution Method and Classification Algorithm Suggesting Based on Dataset Type Segmentation

Kim, Jeonghun (Graduate School of Business IT, Kookmin University)
Kwahk, Kee-Young (College of Business Administration/Graduate School of Business IT, Kookmin University)

Publication Information

Journal of Intelligence and Information Systems / v.28, no.3, 2022 , pp. 23-43 More about this Journal

Abstract

In order to apply AI (Artificial Intelligence) in various industries, interest in algorithm selection is increasing. Algorithm selection is largely determined by the experience of a data scientist. However, in the case of an inexperienced data scientist, an algorithm is selected through meta-learning based on dataset characteristics. However, since the selection process is a black box, it was not possible to know on what basis the existing algorithm recommendation was derived. Accordingly, this study uses k-means cluster analysis to classify types according to data set characteristics, and to explore suitable classification algorithms and methods for resolving class imbalance. As a result of this study, four types were derived, and an appropriate class imbalance resolution method and classification algorithm were recommended according to the data set type.

Keywords

Class Imbalance; Meta Learning; Dataset Type; Clustering Analysis; Data characteristics;

Citations & Related Records

Times Cited By KSCI : 1 (Citation Analysis)

Reference
Cited By KSCI

1	Kim, J., Kim, M. Y., & Kwon, O. (2020). The Effect of Meta-Features of Multiclass Datasets on the Performance of Classification Algorithms. Journal of Intelligence and Information Systems, 26(1), 23-45. DOI
2	Anwar, N., Jones, G., & Ganesh, S. (2014). Measurement of data complexity for classification problems with unbalanced data. Statistical Analysis and Data Mining: The ASA Data Science Journal, 7(3), 194-211. DOI
3	Ho, T. K., & Basu, M. (2002). Complexity measures of supervised classification problems. IEEE transactions on pattern analysis and machine intelligence, 24(3), 289-300. DOI
4	Zhang, X., Li, R., Zhang, B., Yang, Y., Guo, J., & Ji, X. (2019). An instance-based learning recommendation algorithm of imbalance handling methods. Applied Mathematics and Computation, 351, 204-218. DOI
5	Amin, A., Anwar, S., Adnan, A., Nawaz, M., Howard, N., Qadir, J., ... & Hussain, A. (2016). Comparing oversampling techniques to handle the class imbalance problem: A customer churn prediction case study. IEEE Access, 4, 7940-7957. DOI
6	Pimentel, B. A., & De Carvalho, A. C. (2019). A new data characterization for selecting clustering algorithms using meta-learning. Information Sciences, 477, 203-219. DOI
7	Strubell, E., Ganesh, A., & McCallum, A. (2019). Energy and policy considerations for deep learning in NLP. arXiv preprint arXiv:1906.02243.
8	Weiss, G. M., & Provost, F. (2003). Learning when training data are costly: The effect of class distribution on tree induction. Journal of artificial intelligence research, 19, 315-354. DOI
9	Cano, J. R. (2013). Analysis of data complexity measures for classification. Expert systems with applications, 40(12), 4820-4831. DOI
10	George, G., Haas, M. R., & Pentland, A. (2014). Big data and management. Academy of management Journal, 57(2), 321-326. DOI
11	Huang, Y. M., Hung, C. M., & Jiau, H. C. (2006). Evaluation of neural networks and data mining methods on a credit assessment task for class imbalance problem. Nonlinear Analysis: Real World Applications, 7(4), 720-747. DOI
12	Khoshgoftaar, T. M., Van Hulse, J., & Napolitano, A. (2010). Comparing boosting and bagging techniques with noisy and imbalanced data. IEEE Transactions on Systems, Man, and Cybernetics-Part A: Systems and Humans, 41(3), 552-568. DOI
13	Kim, J., & Kwon, O. (2021). A model for rapid selection and covid-19 prediction with dynamic and imbalanced data. Sustainability, 13(6), 3099. DOI
14	Lee, S., & Shin, T. (2018). Development and application of prediction model of hyperlipidemia using SVM and meta-learning algorithm. Journal of Intelligence and Information Systems, 24(2), 111-124. DOI
15	Lorena, A. C., Maciel, A. I., de Miranda, P. B., Costa, I. G., & Prudencio, R. B. (2018). Data complexity meta-features for regression problems. Machine Learning, 107(1), 209-246. DOI
16	Merz, P. (2004). Advanced fitness landscape analysis and the performance of memetic algorithms. Evolutionary Computation, 12(3), 303-325. DOI
17	Jo, T., & Japkowicz, N. (2004). Class imbalances versus small disjuncts. ACM Sigkdd Explorations Newsletter, 6(1), 40-49. DOI
18	Khan, I., Zhang, X., Rehman, M., & Ali, R. (2020). A literature survey and empirical study of meta-learning for classifier selection. IEEE Access, 8, 10262-10281. DOI
19	Kim, E., & Hong, T. (2015). Response Modeling for the Marketing Promotion with Weighted Case Based Reasoning Under Imbalanced Data Distribution. Journal of Intelligence and Information Systems, 21(1), 29-45. DOI
20	Kotsiantis, S., & Kanellopoulos, D. (2006). Discretization techniques: A recent survey. GESTS International Transactions on Computer Science and Engineering, 32(1), 47-58.
21	Krawczyk, B. (2016). Learning from imbalanced data: open challenges and future directions. Progress in Artificial Intelligence, 5(4), 221-232. DOI
22	Leyva, E., Gonzalez, A., & Perez, R. (2014). A set of complexity measures designed for applying meta-learning to instance selection. IEEE Transactions on Knowledge and Data Engineering, 27(2), 354-367. DOI
23	Lu, W. Z., & Wang, D. (2008). Ground-level ozone prediction by support vector machine approach with a cost-sensitive classification scheme. Science of the total environment, 395(2-3), 109-116. DOI
24	Matsumoto, A., Merlone, U., & Szidarovszky, F. (2012). Some notes on applying the Herfindahl-Hirschman Index. Applied Economics Letters, 19(2), 181-184. DOI
25	Van der Walt, C. M., & Barnard, E. (2007). Data characteristics that determine classifier performance. SAIEE Africa Research Journal, 98(3), 87-93. DOI
26	Munoz, M. A., Sun, Y., Kirley, M., & Halgamuge, S. K. (2015). Algorithm selection for black-box continuous optimization problems: A survey on methods and challenges. Information Sciences, 317, 224-245. DOI
27	Park, G. U., & Jung, I. (2019). Comparison of resampling methods for dealing with imbalanced data in binary classification problem. The Korean Journal of Applied Statistics, 32(3), 349-374. DOI
28	Qureshi, S. R., & Gupta, A. (2014, March). Towards efficient Big Data and data analytics: A review. In 2014 Conference on IT in Business, Industry and Government (CSIBIG) (pp. 1-6). IEEE.
29	Rossi, A. L. D., de Leon Ferreira, A. C. P., Soares, C., & De Souza, B. F. (2014). MetaStream: A meta-learning based method for periodic algorithm selection in time-changing data. Neurocomputing, 127, 52-64. DOI
30	Sun, A., Lim, E. P., & Liu, Y. (2009). On strategies for imbalanced text classification using SVM: A comparative study. Decision Support Systems, 48(1), 191-201. DOI
31	Wolpert, D. H., & Macready, W. G. (1997). No free lunch theorems for optimization. IEEE transactions on evolutionary computation, 1(1), 67-82. DOI
32	Pasupa, K., Vatathanavaro, S., & Tungjitnob, S. (2020). Convolutional neural networks based focal loss for class imbalance problem: a case study of canine red blood cells morphology classification. Journal of Ambient Intelligence and Humanized Computing, 1-17.
33	Lopez, V., Fernandez, A., Garcia, S., Palade, V., & Herrera, F. (2013). An insight into classification with imbalanced data: Empirical results and current trends on using data intrinsic characteristics. Information sciences, 250, 113-141. DOI
34	Pfahringer, B., Bensusan, H., & Giraud-Carrier, C. G. (2000, June). Meta-Learning by Landmarking Various Learning Algorithms. In ICML (pp. 743-750).
35	Blagus, R., & Lusa, L. (2013). Improved shrunken centroid classifiers for high-dimensional class-imbalanced data. BMC bioinformatics, 14(1), 1-13. DOI
36	Dogan, N., & Tanrikulu, Z. (2013). A comparative analysis of classification algorithms in data mining for accuracy, speed and robustness. Information Technology and Management, 14(2), 105-124. DOI
37	Feng, S., Keung, J., Yu, X., Xiao, Y., Bennin, K. E., Kabir, M. A., & Zhang, M. (2021). COSTE: Complexity-based OverSampling TEchnique to alleviate the class imbalance problem in software defect prediction. Information and Software Technology, 129, 106432. DOI
38	Ho, T. K. (2002). A data complexity analysis of comparative advantages of decision forest constructors. Pattern Analysis & Applications, 5(2), 102-112. DOI
39	Pascual-Triana, J. D., Charte, D., Andres Arroyo, M., Fernandez, A., & Herrera, F. (2021). Revisiting data complexity metrics based on morphology for overlap and imbalance: snapshot, new overlap number of balls metrics and singular problems prospect. Knowledge and Information Systems, 63(7), 1961-1989. DOI

KSCI

Class Imbalance Resolution Method and Classification Algorithm Suggesting Based on Dataset Type Segmentation 데이터셋 유형 분류를 통한 클래스 불균형 해소 방법 및 분류 알고리즘 추천

Class Imbalance Resolution Method and Classification Algorithm Suggesting Based on Dataset Type Segmentation