• 제목/요약/키워드: Imbalanced Dataset

검색결과 49건 처리시간 0.019초

Default Prediction for Real Estate Companies with Imbalanced Dataset

  • Dong, Yuan-Xiang;Xiao, Zhi;Xiao, Xue
    • Journal of Information Processing Systems
    • /
    • 제10권2호
    • /
    • pp.314-333
    • /
    • 2014
  • When analyzing default predictions in real estate companies, the number of non-defaulted cases always greatly exceeds the defaulted ones, which creates the two-class imbalance problem. This lowers the ability of prediction models to distinguish the default sample. In order to avoid this sample selection bias and to improve the prediction model, this paper applies a minority sample generation approach to create new minority samples. The logistic regression, support vector machine (SVM) classification, and neural network (NN) classification use an imbalanced dataset. They were used as benchmarks with a single prediction model that used a balanced dataset corrected by the minority samples generation approach. Instead of using prediction-oriented tests and the overall accuracy, the true positive rate (TPR), the true negative rate (TNR), G-mean, and F-score are used to measure the performance of default prediction models for imbalanced dataset. In this paper, we describe an empirical experiment that used a sampling of 14 default and 315 non-default listed real estate companies in China and report that most results using single prediction models with a balanced dataset generated better results than an imbalanced dataset.

Classification for Imbalanced Breast Cancer Dataset Using Resampling Methods

  • Hana Babiker, Nassar
    • International Journal of Computer Science & Network Security
    • /
    • 제23권1호
    • /
    • pp.89-95
    • /
    • 2023
  • Analyzing breast cancer patient files is becoming an exciting area of medical information analysis, especially with the increasing number of patient files. In this paper, breast cancer data is collected from Khartoum state hospital, and the dataset is classified into recurrence and no recurrence. The data is imbalanced, meaning that one of the two classes have more sample than the other. Many pre-processing techniques are applied to classify this imbalanced data, resampling, attribute selection, and handling missing values, and then different classifiers models are built. In the first experiment, five classifiers (ANN, REP TREE, SVM, and J48) are used, and in the second experiment, meta-learning algorithms (Bagging, Boosting, and Random subspace). Finally, the ensemble model is used. The best result was obtained from the ensemble model (Boosting with J48) with the highest accuracy 95.2797% among all the algorithms, followed by Bagging with J48(90.559%) and random subspace with J48(84.2657%). The breast cancer imbalanced dataset was classified into recurrence, and no recurrence with different classified algorithms and the best result was obtained from the ensemble model.

가우시안 기반 Hyper-Rectangle 생성을 이용한 효율적 단일 분류기 (An Efficient One Class Classifier Using Gaussian-based Hyper-Rectangle Generation)

  • 김도균;최진영;고정한
    • 산업경영시스템학회지
    • /
    • 제41권2호
    • /
    • pp.56-64
    • /
    • 2018
  • In recent years, imbalanced data is one of the most important and frequent issue for quality control in industrial field. As an example, defect rate has been drastically reduced thanks to highly developed technology and quality management, so that only few defective data can be obtained from production process. Therefore, quality classification should be performed under the condition that one class (defective dataset) is even smaller than the other class (good dataset). However, traditional multi-class classification methods are not appropriate to deal with such an imbalanced dataset, since they classify data from the difference between one class and the others that can hardly be found in imbalanced datasets. Thus, one-class classification that thoroughly learns patterns of target class is more suitable for imbalanced dataset since it only focuses on data in a target class. So far, several one-class classification methods such as one-class support vector machine, neural network and decision tree there have been suggested. One-class support vector machine and neural network can guarantee good classification rate, and decision tree can provide a set of rules that can be clearly interpreted. However, the classifiers obtained from the former two methods consist of complex mathematical functions and cannot be easily understood by users. In case of decision tree, the criterion for rule generation is ambiguous. Therefore, as an alternative, a new one-class classifier using hyper-rectangles was proposed, which performs precise classification compared to other methods and generates rules clearly understood by users as well. In this paper, we suggest an approach for improving the limitations of those previous one-class classification algorithms. Specifically, the suggested approach produces more improved one-class classifier using hyper-rectangles generated by using Gaussian function. The performance of the suggested algorithm is verified by a numerical experiment, which uses several datasets in UCI machine learning repository.

Imbalanced SVM-Based Anomaly Detection Algorithm for Imbalanced Training Datasets

  • Wang, GuiPing;Yang, JianXi;Li, Ren
    • ETRI Journal
    • /
    • 제39권5호
    • /
    • pp.621-631
    • /
    • 2017
  • Abnormal samples are usually difficult to obtain in production systems, resulting in imbalanced training sample sets. Namely, the number of positive samples is far less than the number of negative samples. Traditional Support Vector Machine (SVM)-based anomaly detection algorithms perform poorly for highly imbalanced datasets: the learned classification hyperplane skews toward the positive samples, resulting in a high false-negative rate. This article proposes a new imbalanced SVM (termed ImSVM)-based anomaly detection algorithm, which assigns a different weight for each positive support vector in the decision function. ImSVM adjusts the learned classification hyperplane to make the decision function achieve a maximum GMean measure value on the dataset. The above problem is converted into an unconstrained optimization problem to search the optimal weight vector. Experiments are carried out on both Cloud datasets and Knowledge Discovery and Data Mining datasets to evaluate ImSVM. Highly imbalanced training sample sets are constructed. The experimental results show that ImSVM outperforms over-sampling techniques and several existing imbalanced SVM-based techniques.

Heterogeneous Ensemble of Classifiers from Under-Sampled and Over-Sampled Data for Imbalanced Data

  • Kang, Dae-Ki;Han, Min-gyu
    • International journal of advanced smart convergence
    • /
    • 제8권1호
    • /
    • pp.75-81
    • /
    • 2019
  • Data imbalance problem is common and causes serious problem in machine learning process. Sampling is one of the effective methods for solving data imbalance problem. Over-sampling increases the number of instances, so when over-sampling is applied in imbalanced data, it is applied to minority instances. Under-sampling reduces instances, which usually is performed on majority data. We apply under-sampling and over-sampling to imbalanced data and generate sampled data sets. From the generated data sets from sampling and original data set, we construct a heterogeneous ensemble of classifiers. We apply five different algorithms to the heterogeneous ensemble. Experimental results on an intrusion detection dataset as an imbalanced datasets show that our approach shows effective results.

Re-SSS: Rebalancing Imbalanced Data Using Safe Sample Screening

  • Shi, Hongbo;Chen, Xin;Guo, Min
    • Journal of Information Processing Systems
    • /
    • 제17권1호
    • /
    • pp.89-106
    • /
    • 2021
  • Different samples can have different effects on learning support vector machine (SVM) classifiers. To rebalance an imbalanced dataset, it is reasonable to reduce non-informative samples and add informative samples for learning classifiers. Safe sample screening can identify a part of non-informative samples and retain informative samples. This study developed a resampling algorithm for Rebalancing imbalanced data using Safe Sample Screening (Re-SSS), which is composed of selecting Informative Samples (Re-SSS-IS) and rebalancing via a Weighted SMOTE (Re-SSS-WSMOTE). The Re-SSS-IS selects informative samples from the majority class, and determines a suitable regularization parameter for SVM, while the Re-SSS-WSMOTE generates informative minority samples. Both Re-SSS-IS and Re-SSS-WSMOTE are based on safe sampling screening. The experimental results show that Re-SSS can effectively improve the classification performance of imbalanced classification problems.

불균형 데이터 분류를 위한 딥러닝 기반 오버샘플링 기법 (A Deep Learning Based Over-Sampling Scheme for Imbalanced Data Classification)

  • 손민재;정승원;황인준
    • 정보처리학회논문지:소프트웨어 및 데이터공학
    • /
    • 제8권7호
    • /
    • pp.311-316
    • /
    • 2019
  • 분류 문제는 주어진 입력 데이터에 대해 해당 데이터의 클래스를 예측하는 문제로, 자주 쓰이는 방법 중의 하나는 주어진 데이터셋을 사용하여 기계학습 알고리즘을 학습시키는 것이다. 이런 경우 분류하고자 하는 클래스에 따른 데이터의 분포가 균일한 데이터셋이 이상적이지만, 불균형한 분포를 가지고 경우 제대로 분류하지 못하는 문제가 발생한다. 이러한 문제를 해결하기 위해 본 논문에서는 Conditional Generative Adversarial Networks(CGAN)을 활용하여 데이터 수의 균형을 맞추는 오버샘플링 기법을 제안한다. CGAN은 Generative Adversarial Networks(GAN)에서 파생된 생성 모델로, 데이터의 특징을 학습하여 실제 데이터와 유사한 데이터를 생성할 수 있다. 따라서 CGAN이 데이터 수가 적은 클래스의 데이터를 학습하고 생성함으로써 불균형한 클래스 비율을 맞추어 줄 수 있으며, 그에 따라 분류 성능을 높일 수 있다. 실제 수집된 데이터를 이용한 실험을 통해 CGAN을 활용한 오버샘플링 기법이 효과가 있음을 보이고 기존 오버샘플링 기법들과 비교하여 기존 기법들보다 우수함을 입증하였다.

클래스 불균형 데이터에 적합한 기계 학습 기반 침입 탐지 시스템 (Machine Learning Based Intrusion Detection Systems for Class Imbalanced Datasets)

  • 정윤경;박기남;김현주;김종현;현상원
    • 정보보호학회논문지
    • /
    • 제27권6호
    • /
    • pp.1385-1395
    • /
    • 2017
  • 본 논문에서는 정상과 이상 트래픽이 불균형적으로 발생하는 상황에서 기계 학습 기반의 효과적인 침입 탐지 시스템에 관한 연구 결과를 소개한다. 훈련 데이터의 패턴을 학습하여 정상/이상 패킷을 탐지하는 기계 학습 기반의 IDS에서는 훈련 데이터의 클래스 불균형 정도에 따라 탐지 성능이 현저히 차이가 날 수 있으나, IDS 개발 시 이러한 문제에 대한 고려는 부족한 실정이다. 클래스 불균형 데이터가 발생하는 환경에서도 우수한 탐지 성능을 제공하는 기계 학습 알고리즘을 선정하기 위하여, 본 논문에서는 Kyoto 2006+ 데이터셋을 이용하여 정상 대 침입 클래스 비율이 서로 다른 클래스 불균형 훈련 데이터를 구축하고 다양한 기계 학습 알고리즘의 인식 성능을 분석하였다. 실험 결과, 대부분의 지도 학습 알고리즘이 좋은 성능을 보인 가운데, Random Forest 알고리즘이 다양한 실험 환경에서 최고의 성능을 보였다.

불균형데이터의 비용민감학습을 통한 국방분야 이미지 분류 성능 향상에 관한 연구 (A Study on the Improvement of Image Classification Performance in the Defense Field through Cost-Sensitive Learning of Imbalanced Data)

  • 정미애;마정목
    • 한국군사과학기술학회지
    • /
    • 제24권3호
    • /
    • pp.281-292
    • /
    • 2021
  • With the development of deep learning technology, researchers and technicians keep attempting to apply deep learning in various industrial and academic fields, including the defense. Most of these attempts assume that the data are balanced. In reality, since lots of the data are imbalanced, the classifier is not properly built and the model's performance can be low. Therefore, this study proposes cost-sensitive learning as a solution to the imbalance data problem of image classification in the defense field. In the proposed model, cost-sensitive learning is a method of giving a high weight on the cost function of a minority class. The results of cost-sensitive based model shows the test F1-score is higher when cost-sensitive learning is applied than general learning's through 160 experiments using submarine/non-submarine dataset and warship/non-warship dataset. Furthermore, statistical tests are conducted and the results are shown significantly.

Similarity Measurement Between Titles and Abstracts Using Bijection Mapping and Phi-Correlation Coefficient

  • John N. Mlyahilu;Jong-Nam Kim
    • 융합신호처리학회논문지
    • /
    • 제23권3호
    • /
    • pp.143-149
    • /
    • 2022
  • This excerpt delineates a quantitative measure of relationship between a research title and its respective abstract extracted from different journal articles documented through a Korean Citation Index (KCI) database published through various journals. In this paper, we propose a machine learning-based similarity metric that does not assume normality on dataset, realizes the imbalanced dataset problem, and zero-variance problem that affects most of the rule-based algorithms. The advantage of using this algorithm is that, it eliminates the limitations experienced by Pearson correlation coefficient (r) and additionally, it solves imbalanced dataset problem. A total of 107 journal articles collected from the database were used to develop a corpus with authors, year of publication, title, and an abstract per each. Based on the experimental results, the proposed algorithm achieved high correlation coefficient values compared to others which are cosine similarity, euclidean, and pearson correlation coefficients by scoring a maximum correlation of 1, whereas others had obtained non-a-number value to some experiments. With these results, we found that an effective title must have high correlation coefficient with the respective abstract.