Browse > Article
http://dx.doi.org/10.5351/KJAS.2021.34.4.599

Discriminant analysis for unbalanced data using HDBSCAN  

Lee, Bo-Hui (Department of Advertising and Public Relations, Silla University)
Kim, Tae-Heon (Department of Statistics, Pusan National University)
Choi, Yong-Seok (Department of Statistics, Pusan National University)
Publication Information
The Korean Journal of Applied Statistics / v.34, no.4, 2021 , pp. 599-609 More about this Journal
Abstract
Data with a large difference in the number of objects between clusters are called unbalanced data. In discriminant analysis of unbalanced data, it is more important to classify objects in minority categories than to classify objects in majority categories well. However, objects in minority categories are often misclassified into majority categories. In this study, we propose a method that combined hierarchical DBSCAN (HDBSCAN) and SMOTE to solve this problem. Using HDBSCAN, it removes noise in minority categories and majority categories. Then it applies SMOTE to create new data. Area under the roc curve (AUC) and F1 scores were used to compare performance with existing methods. As a result, in most cases, the method combining HDBSCAN and synthetic minority oversampling technique (SMOTE) showed a high performance index, and it was found to be an excellent method for classifying unbalanced data.
Keywords
unbalanced data; discriminant analysis; HDBSCAN; SMOTE;
Citations & Related Records
연도 인용수 순위
  • Reference
1 Han H, Wang W, and Mao B (2005). Borderline smote: Anew over sampling method in imbalanced data sets learning. In Proceedings of International Conference on Intelligent Computing, 878-887.
2 He H, Bai Y, Garcia EA, and Li S (2008). ADASYN: Adaptive synthetic sampling approach for imbalanced learning. In Proceedings of International Joint Conference on Neural Networks, 1322-1328.
3 McInnes L and Healy J (2017). Accelerated hierarchical density based clustering, IEEE International Conference on Data Mining Workshops (ICDMW).
4 Ijaz M, Alfian G, Syafrudin M, and Rhee J (2018). Hybrid prediction model for type 2 diabetes and hypertension using DBSCAN-based outlier detection, synthetic minority over sampling technique (SMOTE), and Random Forest, Applied Sciences, 8, 1325.   DOI
5 Choi YS (2018). Multivariate Data Analysis with R, Kyungmoon, Seoul.
6 Chawla NV, Hall LO, Bowyer KW, and Kegelmeyer WP (2002). Smote: Synthetic minority oversampling technique, Journal of Artificial Intelligence Research, 16, 321-357.   DOI