• Title/Summary/Keyword: class imbalance

Search Result 119, Processing Time 0.026 seconds

Improved Focused Sampling for Class Imbalance Problem (클래스 불균형 문제를 해결하기 위한 개선된 집중 샘플링)

  • Kim, Man-Sun;Yang, Hyung-Jeong;Kim, Soo-Hyung;Cheah, Wooi Ping
    • The KIPS Transactions:PartB
    • /
    • v.14B no.4
    • /
    • pp.287-294
    • /
    • 2007
  • Many classification algorithms for real world data suffer from a data class imbalance problem. To solve this problem, various methods have been proposed such as altering the training balance and designing better sampling strategies. The previous methods are not satisfy in the distribution of the input data and the constraint. In this paper, we propose a focused sampling method which is more superior than previous methods. To solve the problem, we must select some useful data set from all training sets. To get useful data set, the proposed method devide the region according to scores which are computed based on the distribution of SOM over the input data. The scores are sorted in ascending order. They represent the distribution or the input data, which may in turn represent the characteristics or the whole data. A new training dataset is obtained by eliminating unuseful data which are located in the region between an upper bound and a lower bound. The proposed method gives a better or at least similar performance compare to classification accuracy of previous approaches. Besides, it also gives several benefits : ratio reduction of class imbalance; size reduction of training sets; prevention of over-fitting. The proposed method has been tested with kNN classifier. An experimental result in ecoli data set shows that this method achieves the precision up to 2.27 times than the other methods.

Development of Evaluation Metrics that Consider Data Imbalance between Classes in Facies Classification (지도학습 기반 암상 분류 시 클래스 간 자료 불균형을 고려한 평가지표 개발)

  • Kim, Dowan;Choi, Junhwan;Byun, Joongmoo
    • Geophysics and Geophysical Exploration
    • /
    • v.23 no.3
    • /
    • pp.131-140
    • /
    • 2020
  • In training a classification model using machine learning, the acquisition of training data is a very important stage, because the amount and quality of the training data greatly influence the model performance. However, when the cost of obtaining data is so high that it is difficult to build ideal training data, the number of samples for each class may be acquired very differently, and a serious data-imbalance problem can occur. If such a problem occurs in the training data, all classes are not trained equally, and classes containing relatively few data will have significantly lower recall values. Additionally, the reliability of evaluation indices such as accuracy and precision will be reduced. Therefore, this study sought to overcome the problem of data imbalance in two stages. First, we introduced weighted accuracy and weighted precision as new evaluation indices that can take into account a data-imbalance ratio by modifying conventional measures of accuracy and precision. Next, oversampling was performed to balance weighted precision and recall among classes. We verified the algorithm by applying it to the problem of facies classification. As a result, the imbalance between majority and minority classes was greatly mitigated, and the boundaries between classes could be more clearly identified.

SVM Ensemble Techniques for Class Imbalance Problem (데이터 불균형 문제에서의 SVM 앙상블 기법의 적용)

  • 강필성;이형주;조성준
    • Proceedings of the Korean Information Science Society Conference
    • /
    • 2004.10b
    • /
    • pp.706-708
    • /
    • 2004
  • 대부분의 기계학습 알고리즘은 학습 데이터에서 각각의 범주간의 비율이 동일하거나 비슷하다는 가정 하에 문제를 풀게 된다. 그러나 실제 문제에서는 그 비율이 동일하지 않으며 매우 큰 차이를 보이기도 하는데, 이는 분류 성능을 저하시키는 요인이기도 하다 따라서 본 논문에서는 이러한 데이터의 불균형 문제를 해소하는 방안으로 SVM 앙상블 기법을 적용한 샘플링을 제안하고 이를 실제 불균형 데이터에 적용함으로써 제안된 방법이 기존의 방법들에 비해 향상된 성능을 나타내는 것을 보였다.

  • PDF

Prediction Model for Gastric Cancer via Class Balancing Techniques

  • Danish, Jamil ;Sellappan, Palaniappan;Sanjoy Kumar, Debnath;Muhammad, Naseem;Susama, Bagchi ;Asiah, Lokman
    • International Journal of Computer Science & Network Security
    • /
    • v.23 no.1
    • /
    • pp.53-63
    • /
    • 2023
  • Many researchers are trying hard to minimize the incidence of cancers, mainly Gastric Cancer (GC). For GC, the five-year survival rate is generally 5-25%, but for Early Gastric Cancer (EGC), it is almost 90%. Predicting the onset of stomach cancer based on risk factors will allow for an early diagnosis and more effective treatment. Although there are several models for predicting stomach cancer, most of these models are based on unbalanced datasets, which favours the majority class. However, it is imperative to correctly identify cancer patients who are in the minority class. This research aims to apply three class-balancing approaches to the NHS dataset before developing supervised learning strategies: Oversampling (Synthetic Minority Oversampling Technique or SMOTE), Undersampling (SpreadSubsample), and Hybrid System (SMOTE + SpreadSubsample). This study uses Naive Bayes, Bayesian Network, Random Forest, and Decision Tree (C4.5) methods. We measured these classifiers' efficacy using their Receiver Operating Characteristics (ROC) curves, sensitivity, and specificity. The validation data was used to test several ways of balancing the classifiers. The final prediction model was built on the one that did the best overall.

Topic Classification for Suicidology

  • Read, Jonathon;Velldal, Erik;Ovrelid, Lilja
    • Journal of Computing Science and Engineering
    • /
    • v.6 no.2
    • /
    • pp.143-150
    • /
    • 2012
  • Computational techniques for topic classification can support qualitative research by automatically applying labels in preparation for qualitative analyses. This paper presents an evaluation of supervised learning techniques applied to one such use case, namely, that of labeling emotions, instructions and information in suicide notes. We train a collection of one-versus-all binary support vector machine classifiers, using cost-sensitive learning to deal with class imbalance. The features investigated range from a simple bag-of-words and n-grams over stems, to information drawn from syntactic dependency analysis and WordNet synonym sets. The experimental results are complemented by an analysis of systematic errors in both the output of our system and the gold-standard annotations.

On the Efficiency of Imbalance in a Class of Manufacturing Systems (제조시스템에 있어서 불균형의 효율성)

  • 김성철
    • Journal of the Korean Operations Research and Management Science Society
    • /
    • v.21 no.3
    • /
    • pp.1-10
    • /
    • 1996
  • In this paper, the problem of simultaneously allocating servers and loadings of stations in a class of manufacturing systems modelled as network of queues is considered. The throughput function of the closed network of queues is demonstrated as a Schur convex function of server allocation, that is, increasing the server allocation vector under majorization increases the performance in the ship in terms of the throughput. It also reduces the congestion in the open network of queues in terms of reducing the total number of jobs in the sense of likelihood ratio ordering. These are the extentions of the numerical results of Green and Guha (1995) in the service system with independent M/M/c systems to the network of queues. The results can be used to support production planning in certain manufacturing systems.

  • PDF

Traffic Data Generation Technique for Improving Network Attack Detection Using Deep Learning (네트워크 공격 탐지 성능향상을 위한 딥러닝을 이용한 트래픽 데이터 생성 연구)

  • Lee, Wooho;Hahm, Jaegyoon;Jung, Hyun Mi;Jeong, Kimoon
    • Journal of the Korea Convergence Society
    • /
    • v.10 no.11
    • /
    • pp.1-7
    • /
    • 2019
  • Recently, various approaches to detect network attacks using machine learning have been studied and are being applied to detect new attacks and to increase precision. However, the machine learning method is dependent on feature extraction and takes a long time and complexity. It also has limitation of performace due to learning data imbalance. In this study, we propose a method to solve the degradation of classification performance due to imbalance of learning data among the limit points of detection system. To do this, we generate data using Generative Adversarial Networks (GANs) and propose a classification method using Convolutional Neural Networks (CNNs). Through this approach, we can confirm that the accuracy is improved when applied to the NSL-KDD and UNSW-NB15 datasets.

Skin Disease Classification Technique Based on Convolutional Neural Network Using Deep Metric Learning (Deep Metric Learning을 활용한 합성곱 신경망 기반의 피부질환 분류 기술)

  • Kim, Kang Min;Kim, Pan-Koo;Chun, Chanjun
    • Smart Media Journal
    • /
    • v.10 no.4
    • /
    • pp.45-54
    • /
    • 2021
  • The skin is the body's first line of defense against external infection. When a skin disease strikes, the skin's protective role is compromised, necessitating quick diagnosis and treatment. Recently, as artificial intelligence has advanced, research for technical applications has been done in a variety of sectors, including dermatology, to reduce the rate of misdiagnosis and obtain quick treatment using artificial intelligence. Although previous studies have diagnosed skin diseases with low incidence, this paper proposes a method to classify common illnesses such as warts and corns using a convolutional neural network. The data set used consists of 3 classes and 2,515 images, but there is a problem of lack of training data and class imbalance. We analyzed the performance using a deep metric loss function and a cross-entropy loss function to train the model. When comparing that in terms of accuracy, recall, F1 score, and accuracy, the former performed better.

A Study on the Classification of Fault Motors using Sound Data (소리 데이터를 이용한 불량 모터 분류에 관한 연구)

  • Il-Sik, Chang;Gooman, Park
    • Journal of Broadcast Engineering
    • /
    • v.27 no.6
    • /
    • pp.885-896
    • /
    • 2022
  • Motor failure in manufacturing plays an important role in future A/S and reliability. Motor failure is detected by measuring sound, current, and vibration. For the data used in this paper, the sound of the car's side mirror motor gear box was used. Motor sound consists of three classes. Sound data is input to the network model through a conversion process through MelSpectrogram. In this paper, various methods were applied, such as data augmentation to improve the performance of classifying fault motors and various methods according to class imbalance were applied resampling, reweighting adjustment, change of loss function and representation learning and classification into two stages. In addition, the curriculum learning method and self-space learning method were compared through a total of five network models such as Bidirectional LSTM Attention, Convolutional Recurrent Neural Network, Multi-Head Attention, Bidirectional Temporal Convolution Network, and Convolution Neural Network, and the optimal configuration was found for motor sound classification.

Study on Lifelog Anomaly Detection using VAE-based Machine Learning Model (VAE(Variational AutoEncoder) 기반 머신러닝 모델을 활용한 체중 라이프로그 이상탐지에 관한 연구)

  • Kim, Jiyong;Park, Minseo
    • The Journal of the Convergence on Culture Technology
    • /
    • v.8 no.4
    • /
    • pp.91-98
    • /
    • 2022
  • Lifelog data continuously collected through a wearable device may contain many outliers, so in order to improve data quality, it is necessary to find and remove outliers. In general, since the number of outliers is less than the number of normal data, a class imbalance problem occurs. To solve this imbalance problem, we propose a method that applies Variational AutoEncoder to outliers. After preprocessing the outlier data with proposed method, it is verified through a number of machine learning models(classification). As a result of verification using body weight data, it was confirmed that the performance was improved in all classification models. Based on the experimental results, when analyzing lifelog body weight data, we propose to apply the LightGBM model with the best performance after preprocessing the data using the outlier processing method proposed in this study.