Browse > Article
http://dx.doi.org/10.5351/KJAS.2021.34.5.711

Selecting the optimal threshold based on impurity index in imbalanced classification  

Jang, Shuin (Department of Statistics, Sookmyung Women's University)
Yeo, In-Kwon (Department of Statistics, Sookmyung Women's University)
Publication Information
The Korean Journal of Applied Statistics / v.34, no.5, 2021 , pp. 711-721 More about this Journal
Abstract
In this paper, we propose the method of adjusting thresholds using impurity indices in classification analysis on imbalanced data. Suppose the minority category is Positive and the majority category is Negative for the imbalanced binomial data. When categories are determined based on the commonly used 0.5 basis, the specificity tends to be high in unbalanced data while the sensitivity is relatively low. Increasing sensitivity is important when proper classification of objects in minority categories is relatively important. We explore how to increase sensitivity through adjusting thresholds. Existing studies have adjusted thresholds based on measures such as G-Mean and F1-score, but in this paper, we propose a method to select optimal thresholds using the chi-square statistic of CHAID, the Gini index of CART, and the entropy of C4.5. We also introduce how to get a possible unique value when multiple optimal thresholds are obtained. Empirical analysis shows what improvements have been made compared to the results based on 0.5 through classification performance metrics.
Keywords
imbalanced data; binomial classification; threshold moving; impurity index;
Citations & Related Records
Times Cited By KSCI : 1  (Citation Analysis)
연도 인용수 순위
1 Akosa J (2017). Predictive accuracy: a misleading performance measure for highly imbalanced data, SAS Papers, 942, 1-12.
2 Breiman L, Friedman JH, Olshen RA, and Stone CJ (1984). Classification and Regression Trees, Chapman & Hall, New York.
3 Collell G, Prelec D, and Patil KR (2018). A simple plug-in bagging ensemble based on threshold-moving for classifying binary and multiclass imbalanced data, Neurocomputing, 275, 330-340.   DOI
4 Espindola RP and Ebecken N (2005). On extending f-measure and G-Mean metrics to multi-class problems, WIT Transactions on Information and Communication Technologies, 35, 25-34.   DOI
5 Kim D, Kang S, and Song J (2015). Classification analysis for unbalanced data, The Korean Journal of Applied Statistics, 28, 495-509.   DOI
6 Quinlan JR (1986). Induction of decision trees, Machine Learning, 1, 81-106.   DOI
7 Quinlan JR (1993). C4.5: Programs for Machine Learning, Morgan Kaufmann Publishers.
8 Chawla NV, Lazarevic A, Hall LO, and Bowyer KW (2003). SMOTEboost: improving prediction of the minority class in boosting. In 7th European Conference on Principles and Practice of Knowledge Discovery in Databases, 107-119.
9 Woods K, Doss C, Bowyer K, Solka J, Priebe C, and Kegelmeyer P (1993). Comparative evaluation of pattern recognition techniques for detection of microcalcifications in mammography, International Journal of Pattern Recognition and Artificial Intelligence, 7, 1417-1436.   DOI
10 Yu H, Mu C, Sun C, Yang W, Yang X, and Zuo X (2015). Support vector machine-based optimized decision threshold adjustment strategy for classifying imbalanced data, Knowledge-Based Systems, 76, 67-78.   DOI
11 Kass GV(1980). An exploratory technique for investigating large quantities of categorical data, Applied Statistics, 29, 119-127.   DOI
12 Longadge R, Dongre SS, and Malik L (2013). Class imbalance problem in data mining: review, International Journal of Computer Science and Network, 2.
13 Voigt T, Fried R, Backes M, and Rhodc W (2014). Threshold optimization for classification in imbalanced data in a problem of gamma-ray astronomy, Advances in Data Analysis and Classification, 8, 195-216.   DOI
14 Chawla NV, Bowyer KW, Hall LO, and Kegelmeyer WP (2002). SMOTE: synthetic minority over-sampling technique, Journal of Artificial Intelligence research, 16, 321-357.   DOI
15 Zou Q, Xie S, Lin Z, Wu M, and Ju Y (2016). Finding the best classification threshold in imbalanced classification, Big Data Research, 5, 2-8.   DOI
16 Kim HY and Lee W (2017). On sampling algorithms for imbalanced binary data: performance comparison and some caveats, The Korean Journal of Applied Statistics, 30, 681-690.   DOI
17 Blake C and Merz C (1998). UCI Repository of Machine Learning Databases, Department of Information and Computer Science, University of California, Irvine.