• Title/Summary/Keyword: Class Imbalance Problem

Search Result 52, Processing Time 0.03 seconds

유전자 알고리즘을 활용한 데이터 불균형 해소 기법의 조합적 활용

  • Jang, Yeong-Sik;Kim, Jong-U;Heo, Jun
    • Proceedings of the Korea Inteligent Information System Society Conference
    • /
    • 2007.05a
    • /
    • pp.309-320
    • /
    • 2007
  • The data imbalance problem which can be uncounted in data mining classification problems typically means that there are more or less instances in a class than those in other classes. It causes low prediction accuracy of the minority class because classifiers tend to assign instances to major classes and ignore the minor class to reduce overall misclassification rate. In order to solve the data imbalance problem, there has been proposed a number of techniques based on resampling with replacement, adjusting decision thresholds, and adjusting the cost of the different classes. In this paper, we study the feasibility of the combination usage of the techniques previously proposed to deal with the data imbalance problem, and suggest a combination method using genetic algorithm to find the optimal combination ratio of the techniques. To improve the prediction accuracy of a minority class, we determine the combination ratio based on the F-value of the minority class as the fitness function of genetic algorithm. To compare the performance with those of single techniques and the matrix-style combination of random percentage, we performed experiments using four public datasets which has been generally used to compare the performance of methods for the data imbalance problem. From the results of experiments, we can find the usefulness of the proposed method.

  • PDF

Learning Behavior Analysis of Bayesian Algorithm Under Class Imbalance Problems (클래스 불균형 문제에서 베이지안 알고리즘의 학습 행위 분석)

  • Hwang, Doo-Sung
    • Journal of the Institute of Electronics Engineers of Korea CI
    • /
    • v.45 no.6
    • /
    • pp.179-186
    • /
    • 2008
  • In this paper we analyse the effects of Bayesian algorithm in teaming class imbalance problems and compare the performance evaluation methods. The teaming performance of the Bayesian algorithm is evaluated over the class imbalance problems generated by priori data distribution, imbalance data rate and discrimination complexity. The experimental results are calculated by the AUC(Area Under the Curve) values of both ROC(Receiver Operator Characteristic) and PR(Precision-Recall) evaluation measures and compared according to imbalance data rate and discrimination complexity. In comparison and analysis, the Bayesian algorithm suffers from the imbalance rate, as the same result in the reported researches, and the data overlapping caused by discrimination complexity is the another factor that hampers the learning performance. As the discrimination complexity and class imbalance rate of the problems increase, the learning performance of the AUC of a PR measure is much more variant than that of the AUC of a ROC measure. But the performances of both measures are similar with the low discrimination complexity and class imbalance rate of the problems. The experimental results show 4hat the AUC of a PR measure is more proper in evaluating the learning of class imbalance problem and furthermore gets the benefit in designing the optimal learning model considering a misclassification cost.

The Optimization of Ensembles for Bankruptcy Prediction (기업부도 예측 앙상블 모형의 최적화)

  • Myoung Jong Kim;Woo Seob Yun
    • Information Systems Review
    • /
    • v.24 no.1
    • /
    • pp.39-57
    • /
    • 2022
  • This paper proposes the GMOPTBoost algorithm to improve the performance of the AdaBoost algorithm for bankruptcy prediction in which class imbalance problem is inherent. AdaBoost algorithm has the advantage of providing a robust learning opportunity for misclassified samples. However, there is a limitation in addressing class imbalance problem because the concept of arithmetic mean accuracy is embedded in AdaBoost algorithm. GMOPTBoost can optimize the geometric mean accuracy and effectively solve the category imbalance problem by applying Gaussian gradient descent. The samples are constructed according to the following two phases. First, five class imbalance datasets are constructed to verify the effect of the class imbalance problem on the performance of the prediction model and the performance improvement effect of GMOPTBoost. Second, class balanced data are constituted through data sampling techniques to verify the performance improvement effect of GMOPTBoost. The main results of 30 times of cross-validation analyzes are as follows. First, the class imbalance problem degrades the performance of ensembles. Second, GMOPTBoost contributes to performance improvements of AdaBoost ensembles trained on imbalanced datasets. Third, Data sampling techniques have a positive impact on performance improvement. Finally, GMOPTBoost contributes to significant performance improvement of AdaBoost ensembles trained on balanced datasets.

Combined Application of Data Imbalance Reduction Techniques Using Genetic Algorithm (유전자 알고리즘을 활용한 데이터 불균형 해소 기법의 조합적 활용)

  • Jang, Young-Sik;Kim, Jong-Woo;Hur, Joon
    • Journal of Intelligence and Information Systems
    • /
    • v.14 no.3
    • /
    • pp.133-154
    • /
    • 2008
  • The data imbalance problem which can be uncounted in data mining classification problems typically means that there are more or less instances in a class than those in other classes. In order to solve the data imbalance problem, there has been proposed a number of techniques based on re-sampling with replacement, adjusting decision thresholds, and adjusting the cost of the different classes. In this paper, we study the feasibility of the combination usage of the techniques previously proposed to deal with the data imbalance problem, and suggest a combination method using genetic algorithm to find the optimal combination ratio of the techniques. To improve the prediction accuracy of a minority class, we determine the combination ratio based on the F-value of the minority class as the fitness function of genetic algorithm. To compare the performance with those of single techniques and the matrix-style combination of random percentage, we performed experiments using four public datasets which has been generally used to compare the performance of methods for the data imbalance problem. From the results of experiments, we can find the usefulness of the proposed method.

  • PDF

The Detection of Online Manipulated Reviews Using Machine Learning and GPT-3 (기계학습과 GPT3를 시용한 조작된 리뷰의 탐지)

  • Chernyaeva, Olga;Hong, Taeho
    • Journal of Intelligence and Information Systems
    • /
    • v.28 no.4
    • /
    • pp.347-364
    • /
    • 2022
  • Fraudulent companies or sellers strategically manipulate reviews to influence customers' purchase decisions; therefore, the reliability of reviews has become crucial for customer decision-making. Since customers increasingly rely on online reviews to search for more detailed information about products or services before purchasing, many researchers focus on detecting manipulated reviews. However, the main problem in detecting manipulated reviews is the difficulties with obtaining data with manipulated reviews to utilize machine learning techniques with sufficient data. Also, the number of manipulated reviews is insufficient compared with the number of non-manipulated reviews, so the class imbalance problem occurs. The class with fewer examples is under-represented and can hamper a model's accuracy, so machine learning methods suffer from the class imbalance problem and solving the class imbalance problem is important to build an accurate model for detecting manipulated reviews. Thus, we propose an OpenAI-based reviews generation model to solve the manipulated reviews imbalance problem, thereby enhancing the accuracy of manipulated reviews detection. In this research, we applied the novel autoregressive language model - GPT-3 to generate reviews based on manipulated reviews. Moreover, we found that applying GPT-3 model for oversampling manipulated reviews can recover a satisfactory portion of performance losses and shows better performance in classification (logit, decision tree, neural networks) than traditional oversampling models such as random oversampling and SMOTE.

Centroid and Nearest Neighbor based Class Imbalance Reduction with Relevant Feature Selection using Ant Colony Optimization for Software Defect Prediction

  • B., Kiran Kumar;Gyani, Jayadev;Y., Bhavani;P., Ganesh Reddy;T, Nagasai Anjani Kumar
    • International Journal of Computer Science & Network Security
    • /
    • v.22 no.10
    • /
    • pp.1-10
    • /
    • 2022
  • Nowadays software defect prediction (SDP) is most active research going on in software engineering. Early detection of defects lowers the cost of the software and also improves reliability. Machine learning techniques are widely used to create SDP models based on programming measures. The majority of defect prediction models in the literature have problems with class imbalance and high dimensionality. In this paper, we proposed Centroid and Nearest Neighbor based Class Imbalance Reduction (CNNCIR) technique that considers dataset distribution characteristics to generate symmetry between defective and non-defective records in imbalanced datasets. The proposed approach is compared with SMOTE (Synthetic Minority Oversampling Technique). The high-dimensionality problem is addressed using Ant Colony Optimization (ACO) technique by choosing relevant features. We used nine different classifiers to analyze six open-source software defect datasets from the PROMISE repository and seven performance measures are used to evaluate them. The results of the proposed CNNCIR method with ACO based feature selection reveals that it outperforms SMOTE in the majority of cases.

Optimization of Uneven Margin SVM to Solve Class Imbalance in Bankruptcy Prediction (비대칭 마진 SVM 최적화 모델을 이용한 기업부실 예측모형의 범주 불균형 문제 해결)

  • Sung Yim Jo;Myoung Jong Kim
    • Information Systems Review
    • /
    • v.24 no.4
    • /
    • pp.23-40
    • /
    • 2022
  • Although Support Vector Machine(SVM) has been used in various fields such as bankruptcy prediction model, the hyperplane learned by SVM in class imbalance problem can be severely skewed toward minority class and has a negative impact on performance because the area of majority class is expanded while the area of minority class is invaded. This study proposed optimized uneven margin SVM(OPT-UMSVM) combining threshold moving or post scaling method with UMSVM to cope with the limitation of the traditional even margin SVM(EMSVM) in class imbalance problem. OPT-UMSVM readjusted the skewed hyperplane to the majority class and had better generation ability than EMSVM improving the sensitivity of minority class and calculating the optimized performance. To validate OPT-UMSVM, 10-fold cross validations were performed on five sub-datasets with different imbalance ratio values. Empirical results showed two main findings. First, UMSVM had a weak effect on improving the performance of EMSVM in balanced datasets, but it greatly outperformed EMSVM in severely imbalanced datasets. Second, compared to EMSVM and conventional UMSVM, OPT-UMSVM had better performance in both balanced and imbalanced datasets and showed a significant difference performance especially in severely imbalanced datasets.

A Method of Bank Telemarketing Customer Prediction based on Hybrid Sampling and Stacked Deep Networks (혼성 표본 추출과 적층 딥 네트워크에 기반한 은행 텔레마케팅 고객 예측 방법)

  • Lee, Hyunjin
    • Journal of Korea Society of Digital Industry and Information Management
    • /
    • v.15 no.3
    • /
    • pp.197-206
    • /
    • 2019
  • Telemarketing has been used in finance due to the reduction of offline channels. In order to select telemarketing target customers, various machine learning techniques have emerged to maximize the effect of minimum cost. However, there are problems that the class imbalance, which the number of marketing success customers is smaller than the number of failed customers, and the recall rate is lower than accuracy. In this paper, we propose a method that solve the imbalanced class problem and increase the recall rate to improve the efficiency. The hybrid sampling method is applied to balance the data in the class, and the stacked deep network is applied to improve the recall and precision as well as the accuracy. The proposed method is applied to actual bank telemarketing data. As a result of the comparison experiment, the accuracy, the recall, and the precision is improved higher than that of the conventional methods.

Learning T.P.O Inference Model of Fashion Outfit Using LDAM Loss in Class Imbalance (LDAM 손실 함수를 활용한 클래스 불균형 상황에서의 옷차림 T.P.O 추론 모델 학습)

  • Park, Jonghyuk
    • Journal of the Korea Convergence Society
    • /
    • v.12 no.3
    • /
    • pp.17-25
    • /
    • 2021
  • When a person wears clothing, it is important to configure an outfit appropriate to the intended occasion. Therefore, T.P.O(Time, Place, Occasion) of the outfit is considered in various fashion recommendation systems based on artificial intelligence. However, there are few studies that directly infer the T.P.O from outfit images, as the nature of the problem causes multi-label and class imbalance problems, which makes model training challenging. Therefore, in this study, we propose a model that can infer the T.P.O of outfit images by employing a label-distribution-aware margin(LDAM) loss function. Datasets for the model training and evaluation were collected from fashion shopping malls. As a result of measuring performance, it was confirmed that the proposed model showed balanced performance in all T.P.O classes compared to baselines.

Optimal Ratio of Data Oversampling Based on a Genetic Algorithm for Overcoming Data Imbalance (데이터 불균형 해소를 위한 유전알고리즘 기반 최적의 오버샘플링 비율)

  • Shin, Seung-Soo;Cho, Hwi-Yeon;Kim, Yong-Hyuk
    • Journal of the Korea Convergence Society
    • /
    • v.12 no.1
    • /
    • pp.49-55
    • /
    • 2021
  • Recently, with the development of database, it is possible to store a lot of data generated in finance, security, and networks. These data are being analyzed through classifiers based on machine learning. The main problem at this time is data imbalance. When we train imbalanced data, it may happen that classification accuracy is degraded due to over-fitting with majority class data. To overcome the problem of data imbalance, oversampling strategy that increases the quantity of data of minority class data is widely used. It requires to tuning process about suitable method and parameters for data distribution. To improve the process, In this study, we propose a strategy to explore and optimize oversampling combinations and ratio based on various methods such as synthetic minority oversampling technique and generative adversarial networks through genetic algorithms. After sampling credit card fraud detection which is a representative case of data imbalance, with the proposed strategy and single oversampling strategies, we compare the performance of trained classifiers with each data. As a result, a strategy that is optimized by exploring for ratio of each method with genetic algorithms was superior to previous strategies.