• 제목/요약/키워드: Imbalance dataset

검색결과 58건 처리시간 0.03초

균형 잡힌 데이터 증강 기반 영상 감정 분류에 관한 연구 (A Study on Visual Emotion Classification using Balanced Data Augmentation)

  • 정치윤;김무섭
    • 한국멀티미디어학회논문지
    • /
    • 제24권7호
    • /
    • pp.880-889
    • /
    • 2021
  • In everyday life, recognizing people's emotions from their frames is essential and is a popular research domain in the area of computer vision. Visual emotion has a severe class imbalance in which most of the data are distributed in specific categories. The existing methods do not consider class imbalance and used accuracy as the performance metric, which is not suitable for evaluating the performance of the imbalanced dataset. Therefore, we proposed a method for recognizing visual emotion using balanced data augmentation to address the class imbalance. The proposed method generates a balanced dataset by adopting the random over-sampling and image transformation methods. Also, the proposed method uses the Focal loss as a loss function, which can mitigate the class imbalance by down weighting the well-classified samples. EfficientNet, which is the state-of-the-art method for image classification is used to recognize visual emotion. We compare the performance of the proposed method with that of conventional methods by using a public dataset. The experimental results show that the proposed method increases the F1 score by 40% compared with the method without data augmentation, mitigating class imbalance without loss of classification accuracy.

Cross-Project Pooling of Defects for Handling Class Imbalance

  • Catherine, J.M.;Djodilatchoumy, S
    • International Journal of Computer Science & Network Security
    • /
    • 제22권10호
    • /
    • pp.11-16
    • /
    • 2022
  • Applying predictive analytics to predict software defects has improved the overall quality and decreased maintenance costs. Many supervised and unsupervised learning algorithms have been used for defect prediction on publicly available datasets. Most of these datasets suffer from an imbalance in the output classes. We study the impact of class imbalance in the defect datasets on the efficiency of the defect prediction model and propose a CPP method for handling imbalances in the dataset. The performance of the methods is evaluated using measures like Matthew's Correlation Coefficient (MCC), Recall, and Accuracy measures. The proposed sampling technique shows significant improvement in the efficiency of the classifier in predicting defects.

불균형 데이터세트 학습에서 정확도 균일화를 위한 학습 방법에 관한 연구 (A Study of a Method for Maintaining Accuracy Uniformity When Using Long-tailed Dataset)

  • 박근표;박흠우;김종국
    • 한국정보처리학회:학술대회논문집
    • /
    • 한국정보처리학회 2023년도 춘계학술발표대회
    • /
    • pp.585-587
    • /
    • 2023
  • Long-tailed datasets have an imbalanced distribution because they consist of a different number of data samples for each class. However, there are problems of the performance degradation in tail-classes and class-accuracy imbalance for all classes. To address these problems, this paper suggests a learning method for training of long-tailed dataset. The proposed method uses and combines two methods; one is a resampling method to generate a uniform mini-batch to prevent the performance degradation in tail-classes, and the other is a reweighting method to address the accuracy imbalance problem. The purpose of our proposed method is to train the learning models to have uniform accuracy for each class in a long-tailed dataset.

데이터셋 유형 분류를 통한 클래스 불균형 해소 방법 및 분류 알고리즘 추천 (Class Imbalance Resolution Method and Classification Algorithm Suggesting Based on Dataset Type Segmentation)

  • 김정훈;곽기영
    • 지능정보연구
    • /
    • 제28권3호
    • /
    • pp.23-43
    • /
    • 2022
  • AI(Artificial Intelligence)를 다양한 산업에서 접목하기 위해 알고리즘 선택에 대한 관심이 증가하고 있다. 알고리즘 선택은 대부분 데이터 과학자의 경험에 의해 결정되는 경우가 많다. 하지만 경험이 부족한 데이터 과학자의 경우 데이터셋 특성 기반의 메타학습(meta learning) 을 통해 알고리즘을 선택한다. 기존의 알고리즘 추천은 선정 과정이 블랙박스이기 때문에 어떠한 근거에 의해 도출되는지 알 수 없었다. 이에 따라 본 연구에서는 k-평균 군집분석을 활용하여 데이터셋 특성에 따라 유형을 나누고 적합한 분류 알고리즘과 클래스 불균형 해소 방법을 탐색한다. 본 연구 결과 네 가지 유형을 도출하였으며 데이터셋 유형에 따라 적합한 클래스 불균형 해소 방법과 분류 알고리즘을 추천하였다.

KNN-Based Automatic Cropping for Improved Threat Object Recognition in X-Ray Security Images

  • Dumagpi, Joanna Kazzandra;Jung, Woo-Young;Jeong, Yong-Jin
    • 전기전자학회논문지
    • /
    • 제23권4호
    • /
    • pp.1134-1139
    • /
    • 2019
  • One of the most important applications of computer vision algorithms is the detection of threat objects in x-ray security images. However, in the practical setting, this task is complicated by two properties inherent to the dataset, namely, the problem of class imbalance and visual complexity. In our previous work, we resolved the class imbalance problem by using a GAN-based anomaly detection to balance out the bias induced by training a classification model on a non-practical dataset. In this paper, we propose a new method to alleviate the visual complexity problem by using a KNN-based automatic cropping algorithm to remove distracting and irrelevant information from the x-ray images. We use the cropped images as inputs to our current model. Empirical results show substantial improvement to our model, e.g. about 3% in the practical dataset, thus further outperforming previous approaches, which is very critical for security-based applications.

Geometric and Semantic Improvement for Unbiased Scene Graph Generation

  • Ruhui Zhang;Pengcheng Xu;Kang Kang;You Yang
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • 제17권10호
    • /
    • pp.2643-2657
    • /
    • 2023
  • Scene graphs are structured representations that can clearly convey objects and the relationships between them, but are often heavily biased due to the highly skewed, long-tailed relational labeling in the dataset. Indeed, the visual world itself and its descriptions are biased. Therefore, Unbiased Scene Graph Generation (USGG) prefers to train models to eliminate long-tail effects as much as possible, rather than altering the dataset directly. To this end, we propose Geometric and Semantic Improvement (GSI) for USGG to mitigate this issue. First, to fully exploit the feature information in the images, geometric dimension and semantic dimension enhancement modules are designed. The geometric module is designed from the perspective that the position information between neighboring object pairs will affect each other, which can improve the recall rate of the overall relationship in the dataset. The semantic module further processes the embedded word vector, which can enhance the acquisition of semantic information. Then, to improve the recall rate of the tail data, the Class Balanced Seesaw Loss (CBSLoss) is designed for the tail data. The recall rate of the prediction is improved by penalizing the body or tail relations that are judged incorrectly in the dataset. The experimental findings demonstrate that the GSI method performs better than mainstream models in terms of the mean Recall@K (mR@K) metric in three tasks. The long-tailed imbalance in the Visual Genome 150 (VG150) dataset is addressed better using the GSI method than by most of the existing methods.

승용자율주행을 위한 의미론적 분할 데이터셋 유효성 검증 (Validation of Semantic Segmentation Dataset for Autonomous Driving)

  • 곽석우;나호용;김경수;송은지;정세영;이계원;정지현;황성호
    • 드라이브 ㆍ 컨트롤
    • /
    • 제19권4호
    • /
    • pp.104-109
    • /
    • 2022
  • For autonomous driving research using AI, datasets collected from road environments play an important role. In other countries, various datasets such as CityScapes, A2D2, and BDD have already been released, but datasets suitable for the domestic road environment still need to be provided. This paper analyzed and verified the dataset reflecting the Korean driving environment. In order to verify the training dataset, the class imbalance was confirmed by comparing the number of pixels and instances of the dataset. A similar A2D2 dataset was trained with the same deep learning model, ConvNeXt, to compare and verify the constructed dataset. IoU was compared for the same class between two datasets with ConvNeXt and mIoU was compared. In this paper, it was confirmed that the collected dataset reflecting the driving environment of Korea is suitable for learning.

Classification for Imbalanced Breast Cancer Dataset Using Resampling Methods

  • Hana Babiker, Nassar
    • International Journal of Computer Science & Network Security
    • /
    • 제23권1호
    • /
    • pp.89-95
    • /
    • 2023
  • Analyzing breast cancer patient files is becoming an exciting area of medical information analysis, especially with the increasing number of patient files. In this paper, breast cancer data is collected from Khartoum state hospital, and the dataset is classified into recurrence and no recurrence. The data is imbalanced, meaning that one of the two classes have more sample than the other. Many pre-processing techniques are applied to classify this imbalanced data, resampling, attribute selection, and handling missing values, and then different classifiers models are built. In the first experiment, five classifiers (ANN, REP TREE, SVM, and J48) are used, and in the second experiment, meta-learning algorithms (Bagging, Boosting, and Random subspace). Finally, the ensemble model is used. The best result was obtained from the ensemble model (Boosting with J48) with the highest accuracy 95.2797% among all the algorithms, followed by Bagging with J48(90.559%) and random subspace with J48(84.2657%). The breast cancer imbalanced dataset was classified into recurrence, and no recurrence with different classified algorithms and the best result was obtained from the ensemble model.

Default Prediction for Real Estate Companies with Imbalanced Dataset

  • Dong, Yuan-Xiang;Xiao, Zhi;Xiao, Xue
    • Journal of Information Processing Systems
    • /
    • 제10권2호
    • /
    • pp.314-333
    • /
    • 2014
  • When analyzing default predictions in real estate companies, the number of non-defaulted cases always greatly exceeds the defaulted ones, which creates the two-class imbalance problem. This lowers the ability of prediction models to distinguish the default sample. In order to avoid this sample selection bias and to improve the prediction model, this paper applies a minority sample generation approach to create new minority samples. The logistic regression, support vector machine (SVM) classification, and neural network (NN) classification use an imbalanced dataset. They were used as benchmarks with a single prediction model that used a balanced dataset corrected by the minority samples generation approach. Instead of using prediction-oriented tests and the overall accuracy, the true positive rate (TPR), the true negative rate (TNR), G-mean, and F-score are used to measure the performance of default prediction models for imbalanced dataset. In this paper, we describe an empirical experiment that used a sampling of 14 default and 315 non-default listed real estate companies in China and report that most results using single prediction models with a balanced dataset generated better results than an imbalanced dataset.

SVM 기반 Bagging과 OoD 탐색을 활용한 제조공정의 불균형 Dataset에 대한 예측모델의 성능향상 (Boosting the Performance of the Predictive Model on the Imbalanced Dataset Using SVM Based Bagging and Out-of-Distribution Detection)

  • 김종훈;오하영
    • 정보처리학회논문지:소프트웨어 및 데이터공학
    • /
    • 제11권11호
    • /
    • pp.455-464
    • /
    • 2022
  • 제조업의 공정에서 생성되는 데이터셋은 크게 두 가지 특징을 가진다. 타겟 클래스의 심각한 불균형과 지속적인 Out-of-Distribution(OoD) 샘플의 발생이다. 클래스 불균형은 SMOTE 및 다양한 샘플링 전략을 통해서 대응할 수 있다. 그러나, OoD 탐색은 현재까지 인공신경망 영역에서만 다뤄져 왔다. OoD 탐색의 적용이 가능한 인공신경망은 제조공정 데이터셋에 대해서 만족스러운 성능을 발현하지 못한다. 원인은 제조공정의 데이터셋이 인공신경망에서 일반적으로 다루는 이미지, 텍스트 데이터셋과 비교해서 크기가 매우 작고, 노이즈가 심하다는 것이다. 또한 인공신경망의 과적합(overfitting) 문제도 제조업 데이터셋에서 인공신경망의 성능을 저하하는 원인으로 지적된다. 이에 현재까지 시도된 바 없는 SVM 알고리즘과 OoD 탐색의 접목을 시도하였다. 또한 예측모델의 정밀도 향상을 위해 배깅(Bagging) 알고리즘을 모델링에 반영하였다.