• Title/Summary/Keyword: Imbalance dataset

Search Result 57, Processing Time 0.023 seconds

A Study on Visual Emotion Classification using Balanced Data Augmentation (균형 잡힌 데이터 증강 기반 영상 감정 분류에 관한 연구)

  • Jeong, Chi Yoon;Kim, Mooseop
    • Journal of Korea Multimedia Society
    • /
    • v.24 no.7
    • /
    • pp.880-889
    • /
    • 2021
  • In everyday life, recognizing people's emotions from their frames is essential and is a popular research domain in the area of computer vision. Visual emotion has a severe class imbalance in which most of the data are distributed in specific categories. The existing methods do not consider class imbalance and used accuracy as the performance metric, which is not suitable for evaluating the performance of the imbalanced dataset. Therefore, we proposed a method for recognizing visual emotion using balanced data augmentation to address the class imbalance. The proposed method generates a balanced dataset by adopting the random over-sampling and image transformation methods. Also, the proposed method uses the Focal loss as a loss function, which can mitigate the class imbalance by down weighting the well-classified samples. EfficientNet, which is the state-of-the-art method for image classification is used to recognize visual emotion. We compare the performance of the proposed method with that of conventional methods by using a public dataset. The experimental results show that the proposed method increases the F1 score by 40% compared with the method without data augmentation, mitigating class imbalance without loss of classification accuracy.

Cross-Project Pooling of Defects for Handling Class Imbalance

  • Catherine, J.M.;Djodilatchoumy, S
    • International Journal of Computer Science & Network Security
    • /
    • v.22 no.10
    • /
    • pp.11-16
    • /
    • 2022
  • Applying predictive analytics to predict software defects has improved the overall quality and decreased maintenance costs. Many supervised and unsupervised learning algorithms have been used for defect prediction on publicly available datasets. Most of these datasets suffer from an imbalance in the output classes. We study the impact of class imbalance in the defect datasets on the efficiency of the defect prediction model and propose a CPP method for handling imbalances in the dataset. The performance of the methods is evaluated using measures like Matthew's Correlation Coefficient (MCC), Recall, and Accuracy measures. The proposed sampling technique shows significant improvement in the efficiency of the classifier in predicting defects.

A Study of a Method for Maintaining Accuracy Uniformity When Using Long-tailed Dataset (불균형 데이터세트 학습에서 정확도 균일화를 위한 학습 방법에 관한 연구)

  • Geun-pyo Park;XinYu Piao;Jong-Kook Kim
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2023.05a
    • /
    • pp.585-587
    • /
    • 2023
  • Long-tailed datasets have an imbalanced distribution because they consist of a different number of data samples for each class. However, there are problems of the performance degradation in tail-classes and class-accuracy imbalance for all classes. To address these problems, this paper suggests a learning method for training of long-tailed dataset. The proposed method uses and combines two methods; one is a resampling method to generate a uniform mini-batch to prevent the performance degradation in tail-classes, and the other is a reweighting method to address the accuracy imbalance problem. The purpose of our proposed method is to train the learning models to have uniform accuracy for each class in a long-tailed dataset.

Class Imbalance Resolution Method and Classification Algorithm Suggesting Based on Dataset Type Segmentation (데이터셋 유형 분류를 통한 클래스 불균형 해소 방법 및 분류 알고리즘 추천)

  • Kim, Jeonghun;Kwahk, Kee-Young
    • Journal of Intelligence and Information Systems
    • /
    • v.28 no.3
    • /
    • pp.23-43
    • /
    • 2022
  • In order to apply AI (Artificial Intelligence) in various industries, interest in algorithm selection is increasing. Algorithm selection is largely determined by the experience of a data scientist. However, in the case of an inexperienced data scientist, an algorithm is selected through meta-learning based on dataset characteristics. However, since the selection process is a black box, it was not possible to know on what basis the existing algorithm recommendation was derived. Accordingly, this study uses k-means cluster analysis to classify types according to data set characteristics, and to explore suitable classification algorithms and methods for resolving class imbalance. As a result of this study, four types were derived, and an appropriate class imbalance resolution method and classification algorithm were recommended according to the data set type.

KNN-Based Automatic Cropping for Improved Threat Object Recognition in X-Ray Security Images

  • Dumagpi, Joanna Kazzandra;Jung, Woo-Young;Jeong, Yong-Jin
    • Journal of IKEEE
    • /
    • v.23 no.4
    • /
    • pp.1134-1139
    • /
    • 2019
  • One of the most important applications of computer vision algorithms is the detection of threat objects in x-ray security images. However, in the practical setting, this task is complicated by two properties inherent to the dataset, namely, the problem of class imbalance and visual complexity. In our previous work, we resolved the class imbalance problem by using a GAN-based anomaly detection to balance out the bias induced by training a classification model on a non-practical dataset. In this paper, we propose a new method to alleviate the visual complexity problem by using a KNN-based automatic cropping algorithm to remove distracting and irrelevant information from the x-ray images. We use the cropped images as inputs to our current model. Empirical results show substantial improvement to our model, e.g. about 3% in the practical dataset, thus further outperforming previous approaches, which is very critical for security-based applications.

Geometric and Semantic Improvement for Unbiased Scene Graph Generation

  • Ruhui Zhang;Pengcheng Xu;Kang Kang;You Yang
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.17 no.10
    • /
    • pp.2643-2657
    • /
    • 2023
  • Scene graphs are structured representations that can clearly convey objects and the relationships between them, but are often heavily biased due to the highly skewed, long-tailed relational labeling in the dataset. Indeed, the visual world itself and its descriptions are biased. Therefore, Unbiased Scene Graph Generation (USGG) prefers to train models to eliminate long-tail effects as much as possible, rather than altering the dataset directly. To this end, we propose Geometric and Semantic Improvement (GSI) for USGG to mitigate this issue. First, to fully exploit the feature information in the images, geometric dimension and semantic dimension enhancement modules are designed. The geometric module is designed from the perspective that the position information between neighboring object pairs will affect each other, which can improve the recall rate of the overall relationship in the dataset. The semantic module further processes the embedded word vector, which can enhance the acquisition of semantic information. Then, to improve the recall rate of the tail data, the Class Balanced Seesaw Loss (CBSLoss) is designed for the tail data. The recall rate of the prediction is improved by penalizing the body or tail relations that are judged incorrectly in the dataset. The experimental findings demonstrate that the GSI method performs better than mainstream models in terms of the mean Recall@K (mR@K) metric in three tasks. The long-tailed imbalance in the Visual Genome 150 (VG150) dataset is addressed better using the GSI method than by most of the existing methods.

Validation of Semantic Segmentation Dataset for Autonomous Driving (승용자율주행을 위한 의미론적 분할 데이터셋 유효성 검증)

  • Gwak, Seoku;Na, Hoyong;Kim, Kyeong Su;Song, EunJi;Jeong, Seyoung;Lee, Kyewon;Jeong, Jihyun;Hwang, Sung-Ho
    • Journal of Drive and Control
    • /
    • v.19 no.4
    • /
    • pp.104-109
    • /
    • 2022
  • For autonomous driving research using AI, datasets collected from road environments play an important role. In other countries, various datasets such as CityScapes, A2D2, and BDD have already been released, but datasets suitable for the domestic road environment still need to be provided. This paper analyzed and verified the dataset reflecting the Korean driving environment. In order to verify the training dataset, the class imbalance was confirmed by comparing the number of pixels and instances of the dataset. A similar A2D2 dataset was trained with the same deep learning model, ConvNeXt, to compare and verify the constructed dataset. IoU was compared for the same class between two datasets with ConvNeXt and mIoU was compared. In this paper, it was confirmed that the collected dataset reflecting the driving environment of Korea is suitable for learning.

Classification for Imbalanced Breast Cancer Dataset Using Resampling Methods

  • Hana Babiker, Nassar
    • International Journal of Computer Science & Network Security
    • /
    • v.23 no.1
    • /
    • pp.89-95
    • /
    • 2023
  • Analyzing breast cancer patient files is becoming an exciting area of medical information analysis, especially with the increasing number of patient files. In this paper, breast cancer data is collected from Khartoum state hospital, and the dataset is classified into recurrence and no recurrence. The data is imbalanced, meaning that one of the two classes have more sample than the other. Many pre-processing techniques are applied to classify this imbalanced data, resampling, attribute selection, and handling missing values, and then different classifiers models are built. In the first experiment, five classifiers (ANN, REP TREE, SVM, and J48) are used, and in the second experiment, meta-learning algorithms (Bagging, Boosting, and Random subspace). Finally, the ensemble model is used. The best result was obtained from the ensemble model (Boosting with J48) with the highest accuracy 95.2797% among all the algorithms, followed by Bagging with J48(90.559%) and random subspace with J48(84.2657%). The breast cancer imbalanced dataset was classified into recurrence, and no recurrence with different classified algorithms and the best result was obtained from the ensemble model.

Default Prediction for Real Estate Companies with Imbalanced Dataset

  • Dong, Yuan-Xiang;Xiao, Zhi;Xiao, Xue
    • Journal of Information Processing Systems
    • /
    • v.10 no.2
    • /
    • pp.314-333
    • /
    • 2014
  • When analyzing default predictions in real estate companies, the number of non-defaulted cases always greatly exceeds the defaulted ones, which creates the two-class imbalance problem. This lowers the ability of prediction models to distinguish the default sample. In order to avoid this sample selection bias and to improve the prediction model, this paper applies a minority sample generation approach to create new minority samples. The logistic regression, support vector machine (SVM) classification, and neural network (NN) classification use an imbalanced dataset. They were used as benchmarks with a single prediction model that used a balanced dataset corrected by the minority samples generation approach. Instead of using prediction-oriented tests and the overall accuracy, the true positive rate (TPR), the true negative rate (TNR), G-mean, and F-score are used to measure the performance of default prediction models for imbalanced dataset. In this paper, we describe an empirical experiment that used a sampling of 14 default and 315 non-default listed real estate companies in China and report that most results using single prediction models with a balanced dataset generated better results than an imbalanced dataset.

Boosting the Performance of the Predictive Model on the Imbalanced Dataset Using SVM Based Bagging and Out-of-Distribution Detection (SVM 기반 Bagging과 OoD 탐색을 활용한 제조공정의 불균형 Dataset에 대한 예측모델의 성능향상)

  • Kim, Jong Hoon;Oh, Hayoung
    • KIPS Transactions on Software and Data Engineering
    • /
    • v.11 no.11
    • /
    • pp.455-464
    • /
    • 2022
  • There are two unique characteristics of the datasets from a manufacturing process. They are the severe class imbalance and lots of Out-of-Distribution samples. Some good strategies such as the oversampling over the minority class, and the down-sampling over the majority class, are well known to handle the class imbalance. In addition, SMOTE has been chosen to address the issue recently. But, Out-of-Distribution samples have been studied just with neural networks. It seems to be hardly shown that Out-of-Distribution detection is applied to the predictive model using conventional machine learning algorithms such as SVM, Random Forest and KNN. It is known that conventional machine learning algorithms are much better than neural networks in prediction performance, because neural networks are vulnerable to over-fitting and requires much bigger dataset than conventional machine learning algorithms does. So, we suggests a new approach to utilize Out-of-Distribution detection based on SVM algorithm. In addition to that, bagging technique will be adopted to improve the precision of the model.