• Title/Summary/Keyword: Imbalanced Data

Search Result 153, Processing Time 0.022 seconds

Under Sampling for Imbalanced Data using Minor Class based SVM (MCSVM) in Semiconductor Process (MCSVM을 이용한 반도체 공정데이터의 과소 추출 기법)

  • Pak, Sae-Rom;Kim, Jun Seok;Park, Cheong-Sool;Park, Seung Hwan;Baek, Jun-Geol
    • Journal of Korean Institute of Industrial Engineers
    • /
    • v.40 no.4
    • /
    • pp.404-414
    • /
    • 2014
  • Yield prediction is important to manage semiconductor quality. Many researches with machine learning algorithms such as SVM (support vector machine) are conducted to predict yield precisely. However, yield prediction using SVM is hard because extremely imbalanced and big data are generated by final test procedure in semiconductor manufacturing process. Using SVM algorithm with imbalanced data sometimes cause unnecessary support vectors from major class because of unselected support vectors from minor class. So, decision boundary at target class can be overwhelmed by effect of observations in major class. For this reason, we propose a under-sampling method with minor class based SVM (MCSVM) which overcomes the limitations of ordinary SVM algorithm. MCSVM constructs the model that fixes some of data from minor class as support vectors, and they can be good samples representing the nature of target class. Several experimental studies with using the data sets from UCI and real manufacturing process represent that our proposed method performs better than existing sampling methods.

Classification of Imbalanced Data Based on MTS-CBPSO Method: A Case Study of Financial Distress Prediction

  • Gu, Yuping;Cheng, Longsheng;Chang, Zhipeng
    • Journal of Information Processing Systems
    • /
    • v.15 no.3
    • /
    • pp.682-693
    • /
    • 2019
  • The traditional classification methods mostly assume that the data for class distribution is balanced, while imbalanced data is widely found in the real world. So it is important to solve the problem of classification with imbalanced data. In Mahalanobis-Taguchi system (MTS) algorithm, data classification model is constructed with the reference space and measurement reference scale which is come from a single normal group, and thus it is suitable to handle the imbalanced data problem. In this paper, an improved method of MTS-CBPSO is constructed by introducing the chaotic mapping and binary particle swarm optimization algorithm instead of orthogonal array and signal-to-noise ratio (SNR) to select the valid variables, in which G-means, F-measure, dimensionality reduction are regarded as the classification optimization target. This proposed method is also applied to the financial distress prediction of Chinese listed companies. Compared with the traditional MTS and the common classification methods such as SVM, C4.5, k-NN, it is showed that the MTS-CBPSO method has better result of prediction accuracy and dimensionality reduction.

Classification of Class-Imbalanced Data: Effect of Over-sampling and Under-sampling of Training Data (계급불균형자료의 분류: 훈련표본 구성방법에 따른 효과)

  • 김지현;정종빈
    • The Korean Journal of Applied Statistics
    • /
    • v.17 no.3
    • /
    • pp.445-457
    • /
    • 2004
  • Given class-imbalanced data in two-class classification problem, we often do over-sampling and/or under-sampling of training data to make it balanced. We investigate the validity of such practice. Also we study the effect of such sampling practice on boosting of classification trees. Through experiments on twelve real datasets it is observed that keeping the natural distribution of training data is the best way if you plan to apply boosting methods to class-imbalanced data.

Oversampling-Based Ensemble Learning Methods for Imbalanced Data (불균형 데이터 처리를 위한 과표본화 기반 앙상블 학습 기법)

  • Kim, Kyung-Min;Jang, Ha-Young;Zhang, Byoung-Tak
    • KIISE Transactions on Computing Practices
    • /
    • v.20 no.10
    • /
    • pp.549-554
    • /
    • 2014
  • Handwritten character recognition data is usually imbalanced because it is collected from the natural language sentences written by different writers. The imbalanced data can cause seriously negative effect on the performance of most of machine learning algorithms. But this problem is typically ignored in handwritten character recognition, because it is considered that most of difficulties in handwritten character recognition is caused by the high variance in data set and similar shapes between characters. We propose the oversampling-based ensemble learning methods to solve imbalanced data problem in handwritten character recognition and to improve the recognition accuracy. Also we show that proposed method achieved improvements in recognition accuracy of minor classes as well as overall recognition accuracy empirically.

Ergonomic Evaluation of Workload in Imbalanced Lower Limbs Postures

  • Kim, Eun-Sik;Yoon, Hoon-Yong
    • Journal of the Ergonomics Society of Korea
    • /
    • v.30 no.5
    • /
    • pp.671-681
    • /
    • 2011
  • Objective: The purpose of this study is to compare the workload level at each lower limbs posture and suggest the ergonomic workstation guideline for working period by evaluating the imbalanced lower limbs postures from the physiological and psychophysical points of view. Background: Many workers like welders are working in various imbalanced lower limbs postures either due to the narrow working conditions or other environmental conditions. Method: Ten male subjects participated in this experiment. Subjects were asked to maintain 3 different lower limbs postures(standing, squatting and bending) with 3 different working conditions(balanced floor with no scaffold, imbalanced floor with 10cm height of scaffold, and imbalanced floor with 20cm height of scaffold). EMG data for the 4 muscle groups(Retus Femoris, Vastus Lateralis, Tibialis Anterior, Gastrocnemius) from each lower limbs posture were collected for 20 seconds every 2 minutes during the 8 minutes sustaining task. Subjects were also asked to report their discomfort ratings of body parts such as waist, upper legs, lower legs, and ankle. Results: The ANOVA results showed that the EMG root mean square(RMS) values and the discomfort ratings(CR-10 Rating Scale) were significantly affected by lower limbs postures and working time(p<0.05). The correlation was analyzed between the EMG data and the discomfort ratings. Also, prediction models for the discomfort rating for each posture were developed using physical condition, working time, and scaffold height. Conclusion: We strongly recommend that one should not work more than 6 minutes in a standing or squatting postures and should not work more than 4 minutes in a bending posture. Application: The results of this study could be used to design and assess working environments and methods. Furthermore, these results could be used to suggest ergonomic guidelines for the lower limbs postures such as squatting and bending in the working fields in order to prevent fatigue and pain in the lower limbs body.

Imbalanced sample fault diagnosis method for rotating machinery in nuclear power plants based on deep convolutional conditional generative adversarial network

  • Zhichao Wang;Hong Xia;Jiyu Zhang;Bo Yang;Wenzhe Yin
    • Nuclear Engineering and Technology
    • /
    • v.55 no.6
    • /
    • pp.2096-2106
    • /
    • 2023
  • Rotating machinery is widely applied in important equipment of nuclear power plants (NPPs), such as pumps and valves. The research on intelligent fault diagnosis of rotating machinery is crucial to ensure the safe operation of related equipment in NPPs. However, in practical applications, data-driven fault diagnosis faces the problem of small and imbalanced samples, resulting in low model training efficiency and poor generalization performance. Therefore, a deep convolutional conditional generative adversarial network (DCCGAN) is constructed to mitigate the impact of imbalanced samples on fault diagnosis. First, a conditional generative adversarial model is designed based on convolutional neural networks to effectively augment imbalanced samples. The original sample features can be effectively extracted by the model based on conditional generative adversarial strategy and appropriate number of filters. In addition, high-quality generated samples are ensured through the visualization of model training process and samples features. Then, a deep convolutional neural network (DCNN) is designed to extract features of mixed samples and implement intelligent fault diagnosis. Finally, based on multi-fault experimental data of motor and bearing, the performance of DCCGAN model for data augmentation and intelligent fault diagnosis is verified. The proposed method effectively alleviates the problem of imbalanced samples, and shows its application value in intelligent fault diagnosis of actual NPPs.

Parameter estimation for the imbalanced credit scoring data using AUC maximization (AUC 최적화를 이용한 낮은 부도율 자료의 모수추정)

  • Hong, C.S.;Won, C.H.
    • The Korean Journal of Applied Statistics
    • /
    • v.29 no.2
    • /
    • pp.309-319
    • /
    • 2016
  • For binary classification models, we consider a risk score that is a function of linear scores and estimate the coefficients of the linear scores. There are two estimation methods: one is to obtain MLEs using logistic models and the other is to estimate by maximizing AUC. AUC approach estimates are better than MLEs when using logistic models under a general situation which does not support logistic assumptions. This paper considers imbalanced data that contains a smaller number of observations in the default class than those in the non-default for credit assessment models; consequently, the AUC approach is applied to imbalanced data. Various logit link functions are used as a link function to generate imbalanced data. It is found that predicted coefficients obtained by the AUC approach are equivalent to (or better) than those from logistic models for low default probability - imbalanced data.

An Efficient One Class Classifier Using Gaussian-based Hyper-Rectangle Generation (가우시안 기반 Hyper-Rectangle 생성을 이용한 효율적 단일 분류기)

  • Kim, Do Gyun;Choi, Jin Young;Ko, Jeonghan
    • Journal of Korean Society of Industrial and Systems Engineering
    • /
    • v.41 no.2
    • /
    • pp.56-64
    • /
    • 2018
  • In recent years, imbalanced data is one of the most important and frequent issue for quality control in industrial field. As an example, defect rate has been drastically reduced thanks to highly developed technology and quality management, so that only few defective data can be obtained from production process. Therefore, quality classification should be performed under the condition that one class (defective dataset) is even smaller than the other class (good dataset). However, traditional multi-class classification methods are not appropriate to deal with such an imbalanced dataset, since they classify data from the difference between one class and the others that can hardly be found in imbalanced datasets. Thus, one-class classification that thoroughly learns patterns of target class is more suitable for imbalanced dataset since it only focuses on data in a target class. So far, several one-class classification methods such as one-class support vector machine, neural network and decision tree there have been suggested. One-class support vector machine and neural network can guarantee good classification rate, and decision tree can provide a set of rules that can be clearly interpreted. However, the classifiers obtained from the former two methods consist of complex mathematical functions and cannot be easily understood by users. In case of decision tree, the criterion for rule generation is ambiguous. Therefore, as an alternative, a new one-class classifier using hyper-rectangles was proposed, which performs precise classification compared to other methods and generates rules clearly understood by users as well. In this paper, we suggest an approach for improving the limitations of those previous one-class classification algorithms. Specifically, the suggested approach produces more improved one-class classifier using hyper-rectangles generated by using Gaussian function. The performance of the suggested algorithm is verified by a numerical experiment, which uses several datasets in UCI machine learning repository.

A Deep Learning Based Over-Sampling Scheme for Imbalanced Data Classification (불균형 데이터 분류를 위한 딥러닝 기반 오버샘플링 기법)

  • Son, Min Jae;Jung, Seung Won;Hwang, Een Jun
    • KIPS Transactions on Software and Data Engineering
    • /
    • v.8 no.7
    • /
    • pp.311-316
    • /
    • 2019
  • Classification problem is to predict the class to which an input data belongs. One of the most popular methods to do this is training a machine learning algorithm using the given dataset. In this case, the dataset should have a well-balanced class distribution for the best performance. However, when the dataset has an imbalanced class distribution, its classification performance could be very poor. To overcome this problem, we propose an over-sampling scheme that balances the number of data by using Conditional Generative Adversarial Networks (CGAN). CGAN is a generative model developed from Generative Adversarial Networks (GAN), which can learn data characteristics and generate data that is similar to real data. Therefore, CGAN can generate data of a class which has a small number of data so that the problem induced by imbalanced class distribution can be mitigated, and classification performance can be improved. Experiments using actual collected data show that the over-sampling technique using CGAN is effective and that it is superior to existing over-sampling techniques.

Response Modeling for the Marketing Promotion with Weighted Case Based Reasoning Under Imbalanced Data Distribution (불균형 데이터 환경에서 변수가중치를 적용한 사례기반추론 기반의 고객반응 예측)

  • Kim, Eunmi;Hong, Taeho
    • Journal of Intelligence and Information Systems
    • /
    • v.21 no.1
    • /
    • pp.29-45
    • /
    • 2015
  • Response modeling is a well-known research issue for those who have tried to get more superior performance in the capability of predicting the customers' response for the marketing promotion. The response model for customers would reduce the marketing cost by identifying prospective customers from very large customer database and predicting the purchasing intention of the selected customers while the promotion which is derived from an undifferentiated marketing strategy results in unnecessary cost. In addition, the big data environment has accelerated developing the response model with data mining techniques such as CBR, neural networks and support vector machines. And CBR is one of the most major tools in business because it is known as simple and robust to apply to the response model. However, CBR is an attractive data mining technique for data mining applications in business even though it hasn't shown high performance compared to other machine learning techniques. Thus many studies have tried to improve CBR and utilized in business data mining with the enhanced algorithms or the support of other techniques such as genetic algorithm, decision tree and AHP (Analytic Process Hierarchy). Ahn and Kim(2008) utilized logit, neural networks, CBR to predict that which customers would purchase the items promoted by marketing department and tried to optimized the number of k for k-nearest neighbor with genetic algorithm for the purpose of improving the performance of the integrated model. Hong and Park(2009) noted that the integrated approach with CBR for logit, neural networks, and Support Vector Machine (SVM) showed more improved prediction ability for response of customers to marketing promotion than each data mining models such as logit, neural networks, and SVM. This paper presented an approach to predict customers' response of marketing promotion with Case Based Reasoning. The proposed model was developed by applying different weights to each feature. We deployed logit model with a database including the promotion and the purchasing data of bath soap. After that, the coefficients were used to give different weights of CBR. We analyzed the performance of proposed weighted CBR based model compared to neural networks and pure CBR based model empirically and found that the proposed weighted CBR based model showed more superior performance than pure CBR model. Imbalanced data is a common problem to build data mining model to classify a class with real data such as bankruptcy prediction, intrusion detection, fraud detection, churn management, and response modeling. Imbalanced data means that the number of instance in one class is remarkably small or large compared to the number of instance in other classes. The classification model such as response modeling has a lot of trouble to recognize the pattern from data through learning because the model tends to ignore a small number of classes while classifying a large number of classes correctly. To resolve the problem caused from imbalanced data distribution, sampling method is one of the most representative approach. The sampling method could be categorized to under sampling and over sampling. However, CBR is not sensitive to data distribution because it doesn't learn from data unlike machine learning algorithm. In this study, we investigated the robustness of our proposed model while changing the ratio of response customers and nonresponse customers to the promotion program because the response customers for the suggested promotion is always a small part of nonresponse customers in the real world. We simulated the proposed model 100 times to validate the robustness with different ratio of response customers to response customers under the imbalanced data distribution. Finally, we found that our proposed CBR based model showed superior performance than compared models under the imbalanced data sets. Our study is expected to improve the performance of response model for the promotion program with CBR under imbalanced data distribution in the real world.