• Title/Summary/Keyword: class imbalance

Search Result 119, Processing Time 0.024 seconds

Ensemble Learning for Solving Data Imbalance in Bankruptcy Prediction (기업부실 예측 데이터의 불균형 문제 해결을 위한 앙상블 학습)

  • Kim, Myoung-Jong
    • Journal of Intelligence and Information Systems
    • /
    • v.15 no.3
    • /
    • pp.1-15
    • /
    • 2009
  • In a classification problem, data imbalance occurs when the number of instances in one class greatly outnumbers the number of instances in the other class. Such data sets often cause a default classifier to be built due to skewed boundary and thus the reduction in the classification accuracy of such a classifier. This paper proposes a Geometric Mean-based Boosting (GM-Boost) to resolve the problem of data imbalance. Since GM-Boost introduces the notion of geometric mean, it can perform learning process considering both majority and minority sides, and reinforce the learning on misclassified data. An empirical study with bankruptcy prediction on Korea companies shows that GM-Boost has the higher classification accuracy than previous methods including Under-sampling, Over-Sampling, and AdaBoost, used in imbalanced data and robust learning performance regardless of the degree of data imbalance.

  • PDF

Learning T.P.O Inference Model of Fashion Outfit Using LDAM Loss in Class Imbalance (LDAM 손실 함수를 활용한 클래스 불균형 상황에서의 옷차림 T.P.O 추론 모델 학습)

  • Park, Jonghyuk
    • Journal of the Korea Convergence Society
    • /
    • v.12 no.3
    • /
    • pp.17-25
    • /
    • 2021
  • When a person wears clothing, it is important to configure an outfit appropriate to the intended occasion. Therefore, T.P.O(Time, Place, Occasion) of the outfit is considered in various fashion recommendation systems based on artificial intelligence. However, there are few studies that directly infer the T.P.O from outfit images, as the nature of the problem causes multi-label and class imbalance problems, which makes model training challenging. Therefore, in this study, we propose a model that can infer the T.P.O of outfit images by employing a label-distribution-aware margin(LDAM) loss function. Datasets for the model training and evaluation were collected from fashion shopping malls. As a result of measuring performance, it was confirmed that the proposed model showed balanced performance in all T.P.O classes compared to baselines.

Image-to-Image Translation with GAN for Synthetic Data Augmentation in Plant Disease Datasets

  • Nazki, Haseeb;Lee, Jaehwan;Yoon, Sook;Park, Dong Sun
    • Smart Media Journal
    • /
    • v.8 no.2
    • /
    • pp.46-57
    • /
    • 2019
  • In recent research, deep learning-based methods have achieved state-of-the-art performance in various computer vision tasks. However, these methods are commonly supervised, and require huge amounts of annotated data to train. Acquisition of data demands an additional costly effort, particularly for the tasks where it becomes challenging to obtain large amounts of data considering the time constraints and the requirement of professional human diligence. In this paper, we present a data level synthetic sampling solution to learn from small and imbalanced data sets using Generative Adversarial Networks (GANs). The reason for using GANs are the challenges posed in various fields to manage with the small datasets and fluctuating amounts of samples per class. As a result, we present an approach that can improve learning with respect to data distributions, reducing the partiality introduced by class imbalance and hence shifting the classification decision boundary towards more accurate results. Our novel method is demonstrated on a small dataset of 2789 tomato plant disease images, highly corrupted with class imbalance in 9 disease categories. Moreover, we evaluate our results in terms of different metrics and compare the quality of these results for distinct classes.

AI Performance Based On Learning-Data Labeling Accuracy (인공지능 학습데이터 라벨링 정확도에 따른 인공지능 성능)

  • Ji-Hoon Lee;Jieun Shin
    • Journal of Industrial Convergence
    • /
    • v.22 no.1
    • /
    • pp.177-183
    • /
    • 2024
  • The study investigates the impact of data quality on the performance of artificial intelligence (AI). To this end, the impact of labeling error levels on the performance of artificial intelligence was compared and analyzed through simulation, taking into account the similarity of data features and the imbalance of class composition. As a result, data with high similarity between characteristic variables were found to be more sensitive to labeling accuracy than data with low similarity between characteristic variables. It was observed that artificial intelligence accuracy tended to decrease rapidly as class imbalance increased. This will serve as the fundamental data for evaluating the quality criteria and conducting related research on artificial intelligence learning data.

A Comparison of Ensemble Methods Combining Resampling Techniques for Class Imbalanced Data (데이터 전처리와 앙상블 기법을 통한 불균형 데이터의 분류모형 비교 연구)

  • Leea, Hee-Jae;Lee, Sungim
    • The Korean Journal of Applied Statistics
    • /
    • v.27 no.3
    • /
    • pp.357-371
    • /
    • 2014
  • There are many studies related to imbalanced data in which the class distribution is highly skewed. To address the problem of imbalanced data, previous studies deal with resampling techniques which correct the skewness of the class distribution in each sampled subset by using under-sampling, over-sampling or hybrid-sampling such as SMOTE. Ensemble methods have also alleviated the problem of class imbalanced data. In this paper, we compare around a dozen algorithms that combine the ensemble methods and resampling techniques based on simulated data sets generated by the Backbone model, which can handle the imbalance rate. The results on various real imbalanced data sets are also presented to compare the effectiveness of algorithms. As a result, we highly recommend the resampling technique combining ensemble methods for imbalanced data in which the proportion of the minority class is less than 10%. We also find that each ensemble method has a well-matched sampling technique. The algorithms which combine bagging or random forest ensembles with random undersampling tend to perform well; however, the boosting ensemble appears to perform better with over-sampling. All ensemble methods combined with SMOTE outperform in most situations.

The Imbalance Compensation in CMG ('제어모멘트자이로'의 질량불균형 보정)

  • Lee, Jong-Kuk;Song, Tae-Seong;Kang, Jeong-Min;Song, Deok-Ki;Kwon, Jun-Beom;Seo, Joong-Bo;Oh, Hwa-Suk;Cheon, Dong-Ik;Hong, Young-Gon;Lee, Jun-Yong
    • Journal of the Korean Society for Aeronautical & Space Sciences
    • /
    • v.48 no.11
    • /
    • pp.861-871
    • /
    • 2020
  • Raising the speed of the momentum wheel in the CMG increases the unintended force and torque caused by mass imbalance. This unintended force and torque should be minimized to get the better quality of satellite SAR image because they lead to the vibration of the output image. This paper shows the works on compensating the static imbalance and couple mass imbalance in the CMG wheel. First, the force and torque at the center of mass generated by the mass imbalance were predicted through M&S analysis. Second, the force and torque were estimated similarly through the M&S analysis when the measurement point was moved from the rotation center. Third, the measurement configuration for the force and torque by the mass imbalance was described. Fourth, the change of the force and torque by adding the specified mass to the momentum wheel was observed after comparing the measurements with the results of the M&S. And finally, the effect of the compensation was analyzed by comparing the force and torque before and after the correction while 24Nm class CMG was running in the standby mode.

Comparison of resampling methods for dealing with imbalanced data in binary classification problem (이분형 자료의 분류문제에서 불균형을 다루기 위한 표본재추출 방법 비교)

  • Park, Geun U;Jung, Inkyung
    • The Korean Journal of Applied Statistics
    • /
    • v.32 no.3
    • /
    • pp.349-374
    • /
    • 2019
  • A class imbalance problem arises when one class outnumbers the other class by a large proportion in binary data. Studies such as transforming the learning data have been conducted to solve this imbalance problem. In this study, we compared resampling methods among methods to deal with an imbalance in the classification problem. We sought to find a way to more effectively detect the minority class in the data. Through simulation, a total of 20 methods of over-sampling, under-sampling, and combined method of over- and under-sampling were compared. The logistic regression, support vector machine, and random forest models, which are commonly used in classification problems, were used as classifiers. The simulation results showed that the random under sampling (RUS) method had the highest sensitivity with an accuracy over 0.5. The next most sensitive method was an over-sampling adaptive synthetic sampling approach. This revealed that the RUS method was suitable for finding minority class values. The results of applying to some real data sets were similar to those of the simulation.

Application of Random Over Sampling Examples(ROSE) for an Effective Bankruptcy Prediction Model (효과적인 기업부도 예측모형을 위한 ROSE 표본추출기법의 적용)

  • Ahn, Cheolhwi;Ahn, Hyunchul
    • The Journal of the Korea Contents Association
    • /
    • v.18 no.8
    • /
    • pp.525-535
    • /
    • 2018
  • If the frequency of a particular class is excessively higher than the frequency of other classes in the classification problem, data imbalance problems occur, which make machine learning distorted. Corporate bankruptcy prediction often suffers from data imbalance problems since the ratio of insolvent companies is generally very low, whereas the ratio of solvent companies is very high. To mitigate these problems, it is required to apply a proper sampling technique. Until now, oversampling techniques which adjust the class distribution of a data set by sampling minor class with replacement have popularly been used. However, they are a risk of overfitting. Under this background, this study proposes ROSE(Random Over Sampling Examples) technique which is proposed by Menardi and Torelli in 2014 for the effective corporate bankruptcy prediction. The ROSE technique creates new learning samples by synthesizing the samples for learning, so it leads to better prediction accuracy of the classifiers while avoiding the risk of overfitting. Specifically, our study proposes to combine the ROSE method with SVM(support vector machine), which is known as the best binary classifier. We applied the proposed method to a real-world bankruptcy prediction case of a Korean major bank, and compared its performance with other sampling techniques. Experimental results showed that ROSE contributed to the improvement of the prediction accuracy of SVM in bankruptcy prediction compared to other techniques, with statistical significance. These results shed a light on the fact that ROSE can be a good alternative for resolving data imbalance problems of the prediction problems in social science area other than bankruptcy prediction.

Boosting the Performance of the Predictive Model on the Imbalanced Dataset Using SVM Based Bagging and Out-of-Distribution Detection (SVM 기반 Bagging과 OoD 탐색을 활용한 제조공정의 불균형 Dataset에 대한 예측모델의 성능향상)

  • Kim, Jong Hoon;Oh, Hayoung
    • KIPS Transactions on Software and Data Engineering
    • /
    • v.11 no.11
    • /
    • pp.455-464
    • /
    • 2022
  • There are two unique characteristics of the datasets from a manufacturing process. They are the severe class imbalance and lots of Out-of-Distribution samples. Some good strategies such as the oversampling over the minority class, and the down-sampling over the majority class, are well known to handle the class imbalance. In addition, SMOTE has been chosen to address the issue recently. But, Out-of-Distribution samples have been studied just with neural networks. It seems to be hardly shown that Out-of-Distribution detection is applied to the predictive model using conventional machine learning algorithms such as SVM, Random Forest and KNN. It is known that conventional machine learning algorithms are much better than neural networks in prediction performance, because neural networks are vulnerable to over-fitting and requires much bigger dataset than conventional machine learning algorithms does. So, we suggests a new approach to utilize Out-of-Distribution detection based on SVM algorithm. In addition to that, bagging technique will be adopted to improve the precision of the model.

Crack Detection Method for Tunnel Lining Surfaces using Ternary Classifier

  • Han, Jeong Hoon;Kim, In Soo;Lee, Cheol Hee;Moon, Young Shik
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.14 no.9
    • /
    • pp.3797-3822
    • /
    • 2020
  • The inspection of cracks on the surface of tunnel linings is a common method of evaluate the condition of the tunnel. In particular, determining the thickness and shape of a crack is important because it indicates the external forces applied to the tunnel and the current condition of the concrete structure. Recently, several automatic crack detection methods have been proposed to identify cracks using captured tunnel lining images. These methods apply an image-segmentation mechanism with well-annotated datasets. However, generating the ground truths requires many resources, and the small proportion of cracks in the images cause a class-imbalance problem. A weakly annotated dataset is generated to reduce resource consumption and avoid the class-imbalance problem. However, the use of the dataset results in a large number of false positives and requires post-processing for accurate crack detection. To overcome these issues, we propose a crack detection method using a ternary classifier. The proposed method significantly reduces the false positive rate, and the performance (as measured by the F1 score) is improved by 0.33 compared to previous methods. These results demonstrate the effectiveness of the proposed method.