[KSCI] Korea Science Citation Index Service

http://dx.doi.org/10.5351/KJAS.2021.34.5.711

Selecting the optimal threshold based on impurity index in imbalanced classification

Jang, Shuin (Department of Statistics, Sookmyung Women's University)
Yeo, In-Kwon (Department of Statistics, Sookmyung Women's University)

Publication Information

The Korean Journal of Applied Statistics / v.34, no.5, 2021 , pp. 711-721 More about this Journal

Abstract

In this paper, we propose the method of adjusting thresholds using impurity indices in classification analysis on imbalanced data. Suppose the minority category is Positive and the majority category is Negative for the imbalanced binomial data. When categories are determined based on the commonly used 0.5 basis, the specificity tends to be high in unbalanced data while the sensitivity is relatively low. Increasing sensitivity is important when proper classification of objects in minority categories is relatively important. We explore how to increase sensitivity through adjusting thresholds. Existing studies have adjusted thresholds based on measures such as G-Mean and F1-score, but in this paper, we propose a method to select optimal thresholds using the chi-square statistic of CHAID, the Gini index of CART, and the entropy of C4.5. We also introduce how to get a possible unique value when multiple optimal thresholds are obtained. Empirical analysis shows what improvements have been made compared to the results based on 0.5 through classification performance metrics.

Keywords

imbalanced data; binomial classification; threshold moving; impurity index;

Citations & Related Records

Times Cited By KSCI : 1 (Citation Analysis)

Reference
Cited By KSCI

1	Akosa J (2017). Predictive accuracy: a misleading performance measure for highly imbalanced data, SAS Papers, 942, 1-12.
2	Breiman L, Friedman JH, Olshen RA, and Stone CJ (1984). Classification and Regression Trees, Chapman & Hall, New York.
3	Collell G, Prelec D, and Patil KR (2018). A simple plug-in bagging ensemble based on threshold-moving for classifying binary and multiclass imbalanced data, Neurocomputing, 275, 330-340. DOI
4	Espindola RP and Ebecken N (2005). On extending f-measure and G-Mean metrics to multi-class problems, WIT Transactions on Information and Communication Technologies, 35, 25-34. DOI
5	Kim D, Kang S, and Song J (2015). Classification analysis for unbalanced data, The Korean Journal of Applied Statistics, 28, 495-509. DOI
6	Quinlan JR (1986). Induction of decision trees, Machine Learning, 1, 81-106. DOI
7	Quinlan JR (1993). C4.5: Programs for Machine Learning, Morgan Kaufmann Publishers.
8	Chawla NV, Lazarevic A, Hall LO, and Bowyer KW (2003). SMOTEboost: improving prediction of the minority class in boosting. In 7th European Conference on Principles and Practice of Knowledge Discovery in Databases, 107-119.
9	Woods K, Doss C, Bowyer K, Solka J, Priebe C, and Kegelmeyer P (1993). Comparative evaluation of pattern recognition techniques for detection of microcalcifications in mammography, International Journal of Pattern Recognition and Artificial Intelligence, 7, 1417-1436. DOI
10	Yu H, Mu C, Sun C, Yang W, Yang X, and Zuo X (2015). Support vector machine-based optimized decision threshold adjustment strategy for classifying imbalanced data, Knowledge-Based Systems, 76, 67-78. DOI
11	Kass GV(1980). An exploratory technique for investigating large quantities of categorical data, Applied Statistics, 29, 119-127. DOI
12	Longadge R, Dongre SS, and Malik L (2013). Class imbalance problem in data mining: review, International Journal of Computer Science and Network, 2.
13	Voigt T, Fried R, Backes M, and Rhodc W (2014). Threshold optimization for classification in imbalanced data in a problem of gamma-ray astronomy, Advances in Data Analysis and Classification, 8, 195-216. DOI
14	Chawla NV, Bowyer KW, Hall LO, and Kegelmeyer WP (2002). SMOTE: synthetic minority over-sampling technique, Journal of Artificial Intelligence research, 16, 321-357. DOI
15	Zou Q, Xie S, Lin Z, Wu M, and Ju Y (2016). Finding the best classification threshold in imbalanced classification, Big Data Research, 5, 2-8. DOI
16	Kim HY and Lee W (2017). On sampling algorithms for imbalanced binary data: performance comparison and some caveats, The Korean Journal of Applied Statistics, 30, 681-690. DOI
17	Blake C and Merz C (1998). UCI Repository of Machine Learning Databases, Department of Information and Computer Science, University of California, Irvine.

KSCI

Selecting the optimal threshold based on impurity index in imbalanced classification 불균형 자료에서 불순도 지수를 활용한 분류 임계값 선택

Selecting the optimal threshold based on impurity index in imbalanced classification